Microsoft Research Podcast - 115 - Diving into Deep InfoMax with Dr. Devon Hjelm

Starting point is 00:00:00 The key thing that we walked away with with Deep InfoMax was that we don't really care about estimating mutual information. We don't care about the number that corresponds to how dependent things are. We just want a model that understands whether or not there is more or less mutual information so that we could use that number as a learning signal to train the encoder. You're listening to the Microsoft Research Podcast, a show that brings you closer to the cutting edge of technology research

Starting point is 00:00:28 and the scientists behind it. I'm your host, Gretchen Huizinga. Dr. Devin Yelm is a senior researcher at the Microsoft Research Lab in Montreal. And today, he joins me to dive deep into his research on Deep InfoMax, a novel self-supervised learning approach to training AI models and getting good representations without human annotation. He also tells us how an interest in neural networks,

Starting point is 00:00:57 first human and then machine, led to an inspiring career in deep learning research. That and much more on this episode of the Microsoft Research Podcast. Devin Yelm, welcome to the podcast. Thank you. Glad to be here. So you're a senior researcher who's deep into deep learning at the MSR Lab in Montreal. So I've had several of your colleagues on the show over the last couple years, and we've talked about different flavors and different approaches to machine learning and learning machines.

Starting point is 00:01:36 But today I want to hear your take on what you're all up to there. What's the big goal of your lab, and has it changed over the past couple years at all or grown more nuanced given new discoveries and advances in the research? Well, yes, the lab is relatively new. It's only been under Microsoft or MSR for like two or three years now. And the lab is also fairly diverse. It started from a background of like machine reading comprehension and language understanding, trying to build tools

Starting point is 00:02:05 based on language and knowledge graphs and stuff like that for people moving to Montreal and just basically becoming part of the ecosystem there, incorporating more deep learning, incorporating things like fairness and fate. Its mission is very much still focused on empowering people through research and compute and stuff like that. Right. So how would you define sort of the big audacious goal of the work you're doing in Montreal? So the team that I'm part of, we're kind of like the deep learning camp, I guess, where the people who really focus on using these very large deep neural networks. And so the core idea that we're kind of like really focused on is how do we use these big things to help empower people, give them really interesting, new, useful tools that improve their lives. Almost everything that we're assuming here is that we're going to be using deep learning or deep neural networks to do this because over the last decade or so, we've seen tremendous kind of like explosion of utility on models that are based off of deep learning.

Starting point is 00:03:04 And we anticipate that to based off of deep learning. And we anticipate that to continue to be the case. Well, let's talk more specifically about what you're investigating personally and why you think it's important. Give us the virtual earth 3D snapshot of the research interests you have and what they bring to the broader field of machine learning. What gets Devin Yelm up in the morning? When you look at when you're using like a large scale model to produce something useful for people in the world, you're kind of talking about the models taking some data, usually complex and high dimensional that's coming from the real world, it's transforming it in some way. And then from that transformation, it's kind of producing

Starting point is 00:03:42 utility. So for one example, you can imagine like a self-driving car, it's exposed to camera video feed. And then from that camera video feed, it builds an understanding of kind of like all the different objects that it sees in its view, for instance, like different cars, different people and stuff like that. And then from there, it like makes a decision where to drive so that it successfully navigates you down the road without any catastrophic accidents. So the intermediate step in between that is like, what is the product of sort of like the processing of that big network that leads to the good

Starting point is 00:04:15 performance? So in the case of self-driving car, you need a visual system that's able to identify what all the objects are, what they're doing, what their velocities might be, so I can make good decisions on whether or not I want to turn or go straight or slam on the brakes or something like that. I'm really interested in sort of like how do we arrive to those good, what we call representations of the world from high-dimensional data. Well, let's rewind for a minute because where you've been has influenced where you are today. You did a postdoc under Yoshua Bengio, who's a bona fide about what you were working on during your postdoc days and how that work has evolved and informed what you're working on today. Yeah, so I've always been like extremely well influenced by Yoshua and also the general

Starting point is 00:05:17 camp on which he kind of is centered on, which is the whole deep learning camp. Yoshua has always been sort of like really strongly involved with generative models and representation learning and unsupervised learning. So it was just kind of like a natural fit for me to do a postdoc over there. So while I was there, I focused on generative adversarial networks, also called GANs. And this work kind of like naturally led into mutual information estimation because there's a lot of parallels or kind of similarities between how a generative adversarial network learns to generate data and how you might estimate mutual information. And then this ultimately led into stuff having to do with learning representations using mutual information estimation. All right.

Starting point is 00:06:03 Well, I want to go back a little bit because you mentioned a camp. And if I understand that, it's like people getting together and saying, this is how I believe, this is my worldview of deep learning, as opposed to another worldview of deep learning. So can you kind of differentiate what the difference is there? I mean, ultimately, everybody's sort of interested in this general problem space that I described initially, which was like, how do you take complex real-world data and do useful things with it? How do you plan? How do you reason? Stuff like that. But a key component of this is how do you process or how do you perceive the world? Up until deep learning appeared, the fields weren't having tremendous success on how to process very, very large dimensional data, those coming from

Starting point is 00:06:45 vision or natural language. And so when you look at the kind of high-level view of what it means to process complex data and to do useful things on it, different people focus on different parts of that. So for instance, there's a whole field of people who basically focus on features that are given from very complex neural networks and just figure out how to reason on top of those. But then there's also people who believe that, you know, however we perceive the world, they should be packed into symbols that resemble a form of logic. And we need to be using these sorts of things if we really want to be talking about solving these really hard problems. And so the deep learning cap kind of sort of defaults to the idea, well, we're just going to throw the whole thing end-to-end at

Starting point is 00:07:33 the problem and train the whole thing end-to-end and do back propagation, throwing as much data as possible at it. And, you know, it's worked really, really well. And it continues to sort of be one of the factors that drives us forward. Well, I think it's interesting because it does affect, you know, your choice in research on what direction you're going and how you're going to run at the hill. Yeah. And one of the consequences of having to do things end-to-end is it's extremely expensive. So it's actually becoming more and more difficult. Back in the day, people were working with these static image datasets that were small, like 32 by 32 pixels. And this has slowly expanded and exploded to very, very large datasets. People are working with video. And with that, if you want to do things end-to-end,

Starting point is 00:08:21 the compute cost goes up. it becomes more difficult to run thousands of trials of similar models to see which ones work better. And it becomes a thing that's harder for smaller researchers to do and more of a thing that's done by the very biggest players. Well, we've been, and by we I mean you, have been working on AI for more than 50 years now. Right. And there's been some amazing progress for more than 50 years now. Right. And there's been some amazing progress in the field, especially in deep learning,

Starting point is 00:08:48 when it comes to performing easily definable and discrete tasks. But when it comes to performing tasks in complex real-world situations, we are, as you say, still very far from solving AI. So in broad strokes, and I want you to stay kind of high level here, what's the big problem and what's holding machines back IRL? So, I mean, the way that I see it, one of the biggest challenges that we're facing right now after sort of like the surge of deep learning is generalization. So this is the ability for a model, given that it's been trained in a certain way, to perform well in a different situation. And this is really important because it's either very difficult to impossible to collect all possible data that you would need that would resemble sort of like the test environment.

Starting point is 00:09:36 So, for instance, you can imagine the self-driving car scenario. It's very expensive for me to try to train a visual system under all possible road conditions at all times of the day, at all locations on earth. And these models, they do have a tendency, if you're not careful, to totally fail when you present them with new combinations of data such as that. So if I only train in Northern California and I transfer to Quebec in the middle of winter, there's things about those systems that will fail. And then in addition to that, I mean, if we want these things to work

Starting point is 00:10:09 with humans who are notoriously good at expressing unique, hard-to-model behavior, our models have to be pretty good at generalizing to that behavior to actually be useful. Otherwise, it'll only be useful to like a subset of the population. And that's not what we really strive for. Well, one of the thorniest longstanding challenges to ML researchers is learning good representations without annotation. And this is part of the expense problem, right? It's the labeling data and so on. So what's wrong, in your opinion, with the annotation model and the learning algorithms behind it? And what kinds of learning algorithms do you think we need to take us into a new ML future? There's a couple different things, I suppose. If you're going to train a

Starting point is 00:10:56 model under sort of like the standard supervised setting, suppose I'm given, like I said before, like self-driving car data and somebody annotates like the position and the class for every single object in the visual scene. And then I'm able to train a model to do this end to end, you know, it can resolve to a pretty good representation that I might be able to do some planning on. But that annotation alone is very difficult to do. But on top of that, it's difficult to say whether or not any particular annotation is useful for a general task. You know, some scene that's happening,

Starting point is 00:11:30 some video scene or something like that, you're trying to describe what's going on. How would a human describe what's going on? It might only capture a fraction of what really is going on. And a model trained in that way might only be useful for certain tasks and not for other tasks. When we're talking about learning good representations, that's sort of one of the

Starting point is 00:12:02 nuggets of what you're after, right? So let's get specific and talk about how you're taking a run at this good representation hill. Last year, you presented a paper at ICLR that outlines an approach you call Deep Info Max. So tell us about Deep Info Max and start kind of writ large. What is it? And what are the learning principles on which it's based? We'll get real specific and technical in a second here, but give us the big picture. Sure, sure. So at the high level, it's a type of model that learns representations in an unsupervised way that is without labels that a human needs to define ahead of time. And it's also, I guess, what's being called a self-supervised model. And so this is a model that kind of, instead of tasks being designed by a human, in the sense that the labels or targets that it's trying to predict are coming from something like the class cat or dog, it generates its own labels

Starting point is 00:12:59 by basically playing around with the statistics of the data. There's two kind of like core themes behind Deep Info Max. So one is you're given a bunch of data that has some structure, like there's patches, I can extract patches from images, I can present the model whole or parts of the image. And I can basically ask the model, can you tell whether or not these things go together or not? Yeah, so it's just basically a two way game, just yes or no. Does it go together? So this is like one part of it. And then the other part is like the actual function that you use to train this thing. So there has to be sort of like a number that the model outputs that tells you how well it's doing. And this thing is the mutual

Starting point is 00:13:41 information estimation maximization thing. When you present a model, say two different sets of pairs, and you ask it to differentiate between these two, this effectively is forcing the model to learn something about the mutual information about the things that go together. Because you're encouraging it to understand the dependencies or the relationships of the things that go together. So for instance, if I give you a bunch of different pictures of patches of the same cat, you have to understand a little bit of the structure of a cat. And so these things are dependent or related in the sense that they all eventually compose the same thing, a cat. But if I present other things, and I just basically say,

Starting point is 00:14:25 you should be able to tell that these things go together, as opposed to, like, say, patches that come from cats and dogs. It forces the model to learn that these things are related. Let's unpack Deep Info Max technically on several levels and start with a critical question that I think you borrowed from Sesame Street. Can you tell which thing is not like the other? So you're addressing this with the question, does this pair belong together? How do you do that by both borrowing from and diverging from the technical approaches in the mutual information neural estimator, as you call it, or MINE? So MINE at its core is meant to estimate mutual information.

Starting point is 00:15:04 And so mutual information is this quantity that expresses how related two different random variables are or how related two different sets of random variables are. And so it's an extremely important quantity because being able to tell how related things are can help with all sorts of things like prediction, all sorts of other sort of important downstream tasks. But it's also a very notoriously difficult quantity to estimate. So if you have very high dimensional data that's continuous, say like images or language, there traditionally hasn't been any straightforward way to estimate it because you have to do this sort of like infinite

Starting point is 00:15:45 integral over distributions that you don't necessarily know. This is where neural networks come in. Neural networks, and in particular GANs, have this ability to estimate log ratios of probabilities without actually needing to know what the structure of that distribution is. And what I mean by that structure of that distribution is like, we don't know if it's Gaussian or Poisson distributed or whatever. But GANs, they estimate these log ratios, and then they use them to train their generator function. So if you look at the mutual information, it's just like a divergence.

Starting point is 00:16:28 It's a difference, like a difference between two different distributions. One is the joint distribution between two variables, and one is the product of marginals. And so the joint distribution is just basically the probability that these two things co-occur. And then the product of marginals is their probabilities that they occur independently of each other. And so the way that GANs do this estimation is you just draw samples from two different distributions and you train a discriminator. And a discriminator is just a classifier. So you just present samples from one distribution, present samples from the other distribution, and you ask, does it belong to one distribution or the other distribution? So this is sort of like the technical thing. So like if, for instance, if I'm trying to train a model that's able to distinguish between cats and dogs, I present it with cats, and I tell it,

Starting point is 00:17:17 hey, this is label zero. I present it with dogs. This is label one. And at the end of the day, if you train like a standard deep network classifier, it's learning to estimate the log ratio of the probability of a cat or probably a dog. So mutual information estimator, what it essentially comes down to is it's training a classifier between samples that go together and don't go together. So DeepInfoMax is very much based on our work on mutual information neural estimator or mine. What DeepInfoMax is very much based on our work on mutual information, neural estimator, or MINE. What DeepInfoMax basically does is it takes like a full image and it presents it through a deep neural network. And when it gets processed through this deep neural network,

Starting point is 00:18:00 if you look at different layers of this network, this is a convolutional neural network, so different locations of this convolutional neural network have been processed by different patches of the image. So you can think of these features at these different levels, at these different locations, being part of the input. So what DeepInfoMax basically does is it says, well, all those features basically go together. So I'm going to group them all together and present them to a classifier and say, well, tell me that these go together.

Starting point is 00:18:35 Classify it as zero or one, whatever you call together. And then take combinations of those patch representations with images that came from somewhere else, put those together and say, these don't go together. And so that process is actually very similar to what we did in mutual information neural estimation, is the things that go together, these are really like samples from the joint distribution. The things that don't go together, well, this resembles something like samples from the product of marginals. And so when you train a classifier to distinguish between these two, you're training the model in a similar way that you are in mind to interpret the dependencies between all the things that go together that make them go together. Like,

Starting point is 00:19:15 why do they go together? And that's sort of like encoded into the idea about the joint distribution. So when you do that, you really are estimating something like the mutual information. But the key thing that we walked away with with Deep InfoMax was that we don't really care about estimating mutual information. We don't care about the number that corresponds to how dependent things are. We just want a model that understands whether or not there is more or less mutual information so that we could use that number as a learning signal to train the encoder. Well, let's talk a bit more about some of the problems that arise when you aim for, as you call it, pure mutual information maximization.

Starting point is 00:19:58 You've said in the past, that's not actually what we're aiming for here. So what do you do with the issues of noisy information? And what do you really want to aim for here? I guess there's like two different ways to answer that. So one is that, I mean, when we do this deep informatics style learning on images, while it does resemble something like mutual information maximization, there's a caveat in the sense that mine is only an estimator. It's sort of like a lower bound to the mutual information. It can only learn the number of dependencies that it's capable of based on the capacity

Starting point is 00:20:30 of the neural network and the number of samples from the world that it's received. So the lower capacity of the model, the less stuff it'll be able to learn. The less samples it's exposed to, the less stuff it will be able to learn. But it has to learn something. So Deep Info Max is based on this sort of like structural thing where we patch things

Starting point is 00:20:50 in different ways. This kind of biases the model to learn things that are expressible structurally. So for instance, because I'm effectively doing a comparison kind of game between different patches of the same image, it needs to understand why those patches are related. And it maybe doesn't need to understand something more nuanced, like the texture of one of the patches compared to the other. And the reason why it doesn't need to understand that is because it's maybe not learnable. It's a much harder problem than just understanding whether or not the shape goes together or the general color goes together or something like that. So the model will focus on those things that are easy to pick up. And a lot of times, how we design these tasks, this way of breaking up the data in a particular way, when we apply it to mutual information, neural estimation, style learning, or deep InfoMax, matters more than the

Starting point is 00:21:46 actual objective that we use. So is there anything different or anything sort of noteworthy about any of the technical aspects of deep InfoMax that sort of say, hey, stand up and take notice, this is a new approach to solving some of these problems? So the main nuanced thing about the DeepInfomax model was that, so as I mentioned before, we were taking these local representations, these features that corresponded to a patch of the input. And it's important in DeepInfomax, if you do things that way, that those features actually correspond to a patch. What's interesting about a lot of the convolutional models that are used in the wild,

Starting point is 00:22:27 like the very popular ones like ResNet, while they have a spatial extent as you progress through the network, very, very quickly, those locations cover the whole input. So even though it has a spatial extent, there are locations that are spatial in the neural network. But if you back project and look at the stuff that is processed, it's actually processing the whole input. So that's not actually a different view of the input anymore. It's good and bad in the sense that

Starting point is 00:22:57 it's nice that at some point, the architectures are mixing everything they can from the input to try to infer whether or not something belongs to a class in the case of supervised learning. But in the case of self supervised learning, if you want to really leverage these locations from the natural architecture, then you need to be a little bit more careful about how you apply architectures. So if you have these architectures that quickly expand over the full receptive field of the input, then you run into trouble. And so DeepInfoMax is kind of particular in that it really tried to leverage the internal structure of the model over just, say, like pure messing with the input data and then

Starting point is 00:23:40 designing losses on top of it. At this point in the podcast, Devin, I always ask my guests what could possibly go wrong, so I'll ask you too. And I do this because I want to address some of the elephants in the room where, you know, this is a powerful technology. Is there anything about your work that keeps you up at night metaphorically? And if so, how are you addressing it? not our work is like more directly related to things like fairness and privacy and nefarious agents than if we don't. So some people, particularly in like the FATE team, focus more on like sort of the reasoning aspects of our models, like how do they take data presented to them and produce results that are fair or retain privacy and stuff like that. But kind of like the

Starting point is 00:24:43 general trend that we're seeing is that we're using for our reasoning, we're using more and more deep models that produce features that people use to do that reasoning on top of. And so I'm very much interested in how the quality of those features of that model impact these, I don't know what you call them, moral metrics? So for instance, is it possible that my model, if it's presented with a face of a person, also encodes their identity perfectly or something like that? That my features either do or don't make it easy for someone to infer where those features came from and their identities that they might want to keep hidden. So yeah, I mean, like in particular, when you're talking about like the mutual information stuff, like we're working really hard on maximizing mutual information all over the place. So we're just trying to capture as much information about the data as possible. But you could also imagine

Starting point is 00:25:40 using the same techniques to minimize mutual information. So you just basically flip the sign on some of the objective functions and say, okay, there's these properties that I really don't want in this representation. Whatever you do, whatever this representation looks like, minimize it. And you can use the exact same objective function to try to do the exact same thing. It's a little bit trickier because you're dealing with this min-max, but you can imagine doing stuff like that. So it's like another sub-part of that whole question about what a good representation is, is sometimes you don't really want all the underlying things in that representations. You maybe want to actually hide things. Yeah, yeah, yeah. I want to drill in just a little bit on how you can control at the outset, keeping a lid on the things that you could see going wrong with machines that act in the world like humans do and pass the

Starting point is 00:26:35 Turing test on a grand scale. So when we talk about whether or not a representation is good or not or useful, which I guess is like sort of like the core of what I'm focused on. It's important that among the collection of things that we use to evaluate our models, we keep in mind metrics that evaluate things like fairness and privacy. So one thing that we're seeing as we progress in representation learning is that it's not just like one metric that really matters as far as like whether or not the representation is going to be good for deployment on some complicated downstream task it's not going to be just classification there's like a suite of things that we have to evaluate models on and the suite of things kind of like provides a better story

Starting point is 00:27:22 like a fine-grained story as far as like whether or not this representation truly is going to be useful. And one of those dimensions of usefulness is things like privacy and fairness. Speaking of stories, tell us about yourself, Devin, and your path to machine learning research at MSR, and how's the ML game better since you joined the team? I mean, I guess I've always been pretty interested in representation learning, like understanding representations of the world, like deriving them. So I started in physics, which is about learning about representations of the world, which have to do with like dynamics and all sorts of quantities, physical quantities. And then I got interested in languages,

Starting point is 00:28:05 because I was interested in how people represent the world through their language, through words, and their relationships, and stuff like that. And so I went from physics to linguistics, and did a stint there. So I quickly realized that, like, at least where I was at, they didn't have the tools necessary to really solve the types of problems that I was interested in, which was understanding from language how humans represent the world. So I started getting really interested in using computers to help solve these problems, like models and stuff like that. So I got involved with some people over at the CS department and then joined the PhD program in computer science at University of New Mexico, probably like one of the stronger sort of like groups that were focusing on like modeling complex data

Starting point is 00:29:00 and learning representations were on the neuroimaging side. So there was like a big research institute called the Mind Research Network. So I talked to Vince Calhoun, who's still chief officer over there. And I said, hey, I'm interested in these deep neural networks. I think they might be useful for neuroscience stuff. They were looking at like sort of like brain imaging data, like fMRI, EEG, and some other related data sets and modes. And he said, okay, well, here's some public data we have available, try your model on it and see how it works. And I did it and it produced something interesting. And then he said, okay, well, I'll take you on as a grad student. So I lived over

Starting point is 00:29:43 there for a couple of years and I was kind of like the black sheep who was using the deep neural networks while everybody else was using these more linear models like ICA and PCA and stuff like that. So me and my sort of like unofficial advisor, Sergey Polis, were kind of like the deep learning nerds. And we put out some nice papers that use deep learning

Starting point is 00:30:04 and showed that it worked with fMRI data. But through that whole process, because Mind Research Network was such a good research institution with good connections and grant money and stuff like that, they were able to connect our small group to a bunch of really big names in deep learning like Russ Salikutinov and Kunyan Cho. Russ was at the time like a professor at University of Toronto. He's since moved on to CMU. And Kunyan Cho was a postdoc at the time with Yoshua. You know, this entire time, I'm like pushing on the whole representation learning stuff. But in the context of neuroimaging, learning how to use new models, even my stuff with generative models, it was all about learning good representations because you can use the intermediate states for general models as representations as well. And just through those connections, I was able to get more deep learning papers into NeurIPS.

Starting point is 00:31:00 And through that connection, I was able to reach out to Yoshua at the end of my PhD and was able to connect up with them there. And so you were at MILA for a while, Montreal Institute of Learning Algorithms. Yeah, I'm still an adjunct professor there. So I still co-supervise students and help them with the research. What's one thing we don't know about you, Devin? Something interesting that may have impacted your career, sort of personal, maybe it's a side quest or a personality trait or something interesting to sort of give us context about who you are a little deeper in. Well, I played bass, you know, more or less professionally for three years in grad school. You had a band?

Starting point is 00:31:48 Yeah, I'd have gigs like three, sometimes four days a week playing salsa. Are you kidding? Yeah, I'm not. So yeah, we would play everywhere from casinos to dances to everything like that. I mean, I've played music my entire life, but I was pretty ambitious about being very, very good. And so that group was pretty cool because there were super high skilled salsa musicians from like all over the world.

Starting point is 00:32:15 One person was from like a touring group in Cuba. And then there was another guy who joined our group who was on some Grammy albums. And so I got to do that for a little while. It was really good. So I was working as a musician on top of doing my PhD. Do you still play? I don't play bass anymore because I got tired of spending all my Fridays and Saturdays playing the same music all the time and not being able to like sit and like enjoy watching someone

Starting point is 00:32:47 else play. Right now I'm learning how to play the mandolin because it's easy to play by yourself. Let's get some parting thoughts and shots in. As we close, I want to give you the chance to think ahead and dream about the future. Let's say you're wildly successful, Devin. What will the world look like at the end of your career? What will you have accomplished in your field? And what will we be able to do that we hadn't been able to do before? I can imagine all sorts of things that we could do with models that we can't do today. For instance, present a model to a brand new environment. It's able to navigate and explore this environment on its own

Starting point is 00:33:27 with very little help from human experimenters and learn everything it needs to learn to do useful stuff. I firmly believe that it's important that our ultimate goal for all of this AI effort is to arrive to models and algorithms and agents that are useful to human beings in the real world so that they can do things that they couldn't do before more easily, sort of like to empower the more general population. On top of that, if I was wildly successful, is continuing an ongoing, exciting community of people working on really difficult problems because they're

Starting point is 00:34:05 passionate about it. Everybody has sort of like their own ideas as far as like what would be useful or good for people. And as long as I'm part of that, someone who gets to interact with that community and help build and shape those things, I think that's the best possible thing that I can hope for. And so I'm just hoping that like I can be part of that community and it continues to thrive. Devin Yelm, thank you for joining us today. It's been really fun. Thank you. To learn more about Dr. Devin Yelm and the very latest in deep learning research, visit microsoft.com slash research.

Your Ad Here

Microsoft Research Podcast - 115 - Diving into Deep InfoMax with Dr. Devon Hjelm

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.