Microsoft Research Podcast - 101 - Going meta: learning algorithms and the self-supervised machine with Dr. Philip Bachman

Starting point is 00:00:00 Training a machine to look at a large amount of unannotated data and point to specific examples and say, well, I think if a human comes in and tells me exactly what that thing is, I'll learn a lot about the problem that I'm trying to solve. So this general notion of carefully selecting which of those examples you want to spend the money or spend the time to get a human to go in and provide the annotations for those examples. That's this idea of active learning. You're listening to the Microsoft Research Podcast, a show that brings you closer to the cutting edge of technology research and the scientists behind it. I'm your host, Gretchen Huizinga. Deep learning methodologies like supervised learning have been very successful in training machines to make predictions about the world. But because they're so dependent on large amounts of human-annotated data, they've been

Starting point is 00:00:53 difficult to scale. Dr. Phil Bachmann, a researcher at MSR Montreal, would like to change that, and he's working to train machines to collect, sort, and label their own data so people don't have to. Today, Dr. Bachman gives us an overview of the machine learning landscape and tells us why it's been so difficult to sort through noise and get to useful information. He also talks about his ongoing work on Deep InfoMax, a novel approach to self-supervised learning, and reveals what a conversation about ML classification problems has to do with Harrison Ford's face. That and much more on this episode of the Microsoft Research

Starting point is 00:01:32 Podcast. Phil Bachman, welcome to the podcast. Hi, thanks for having me. So as a researcher at MSR Montreal, you've got a lot going on. Let's start macro and then get micro. And we'll start with a little phrase that I like in your bio that says you want to understand the ways in which actionable information can be distilled from raw data. Unpack it for us. What big problem or problems are you working on? What gets you up in the morning? So I'd say the key here is to sort of understand the distinction between information in general and let's say information that might be useful. So for example, if the images are coming from the camera that you're using to pilot a

Starting point is 00:02:23 self-driving car, then low-level sensor noise probably doesn't provide you useful information for deciding whether to stop the car or whether to turn or make other sorts of decisions that are useful for driving. So what I'm interested in, sort of this phrase actionable information here, it's referring specifically to trying to focus on getting our models to capture the information content in the data that we're looking at that is actually going to be useful in the future for making some sorts of decisions. So if we're training a model that's processing the video data that's being used to drive this car, then perhaps we don't want to waste the effort of the model on trying

Starting point is 00:03:06 to represent this low-level information about small variations in pixel intensity. And we'd rather have the model focus its capacity for representing information on the information that corresponds to sort of higher-level structure in the image, so things like the presence or absence of a pedestrian or another car in front of it. So that's kind of what I mean with this phrase actionable information. So this distillation from raw data is on doing learning from data that hasn't been manually curated or that doesn't have a lot of information injected into it by a human who's doing the data collection process. So going back to the self-driving car example, I'd like to have a system where we could allow the computer just to watch thousands of hours

Starting point is 00:03:50 of video that's captured from a bunch of cars driving around. Then what I want to be able to do is have a system that's just watching all of that video and doesn't require that much input from a person who's pointing to the video and saying specifically what's going to be interesting or useful in the future. So this information that's going to be useful for performing the types of tasks that we want our model to do eventually. Before we get specific, give us a short historical tour of the deep learning methodologies as a level set, and then tell us why we need a methodology for learning representations from unlabeled data. Okay, so in the context of machine learning, people often break it down into three categories.

Starting point is 00:04:29 So there will be supervised learning, unsupervised learning, and reinforcement learning. And it's not always clear what the distinction between the methods are, but supervised learning is sort of what's had the most immediate success and what's driving a lot of the deep learning power technologies that are being used for doing things like speech recognition in phones or doing automated question answering for chatbots and stuff like that. So supervised learning refers to kind of a subset of the techniques that people apply when they have access to a large amount of data and they have a specific type of action that they want a model to perform when it processes that data. And what they do is they get a model to perform when it processes that data.

Starting point is 00:05:11 And what they do is they get a person to go and label all the data and say, okay, well, this is the input to the model at this point in time. And given this input, this is what the model should output. So you're putting a lot of constraints on what the model is doing and constructing those constraints manually by having a person looking at a set of a million images. And for each image, they say, oh, this is a cat. this is a dog, this is a person, this is a car. So after having done that for thousands of hours, you now have a large data set where you have a bunch of different images and each of those images has an associated tag. And so now the kind of techniques that we work with and the optimization methods that we use for training our models, are very effective at fitting really large, powerful models to large amounts of this sort of annotated data. So that's kind of the traditional supervised learning.

Starting point is 00:05:54 But the major downside there is that the process of providing all of those annotations can be very expensive. So that process of supervised learning has a lot of issues with scalability. What we'd like to do ideally is make use of a lot of that and figure out what kinds of information is actionable. So finding the information that seems like it will be useful for making decisions. So that's getting into a contrast between supervised learning and unsupervised learning. Then there's also reinforcement learning, which is a slightly different set of techniques where you actually allow a model to go out and kind of perform experiments or try to do things. And then somehow it receives feedback about the things that it's

Starting point is 00:06:34 doing that says, oh, what you just did, that was a good thing or that was a bad thing. And then it learns by kind of a process of trial and error. So that's a general idea of reinforcement learning. Okay. We mentioned two flavors of this, unsupervised and then self-supervised. Is that another differentiation there? So the self-supervised learning, it's not a completely different thing, but it's a sort of subset of those types of techniques. So the general idea behind self-supervised learning is that we try to design procedures that will generate little supervised learning problems for a model to solve, where the process of generating those little supervised learning problems is kind of automatic. And the hope here is that the kind of procedurally generated supervised learning problems that our

Starting point is 00:07:21 little algorithm is generating based on the unlabeled data will force the model to capture some useful information about the structure of that data that will allow it to answer more sort of human-oriented questions easier in the future. So just to clarify this concept of procedurally generating supervised learning problems, one really simple example would be that you could try to train a model to have some understanding of the statistical structure of visual data by showing a model a bunch of images. But what you do is you take each image and you split it into a left half and a right half. So now what you do is you take your model and all the model is allowed to see is the left half of the image. And then you have another model that sort of tries to form a representation of the right half of the image. And so the model that looked at the left half

Starting point is 00:08:13 of the image, you present it with representations of the right halves of like, let's say 10,000 images, one of which is the image that it looked at. So it's kind of got like a partner that it's looking for in this big bag of encoded right halves of images. And the job of the encoder that's processing the left half of the image is to be able to look in that bag and pick out the right half that actually corresponds to the image that it originally came from. So in this case, we're taking something that looks like unsupervised learning, but instead here what we're doing is treating it more like a supervised learning problem.

Starting point is 00:08:49 So the model that looks at the left half of the image, its task is to solve something that looks like just a simple classification problem, and then making this like a 1,000-way classification problem. The other thing that comes to my mind is there's this weird thing on the internet where, like Harrison Ford, you see half of his face and the other half of his face, and they're completely different. Like, if you put each halves together, they wouldn't look like Harrison Ford, but together with the different halves, they look like him. So, that would really trick the machine, I would think. Actually, I wouldn't be so confident about that.

Starting point is 00:09:25 Really? Yeah. The question that you're sort of training the machine to answer is, which of these possible things do you think is most likely associated with the thing that you're currently looking at? So, unless there was somebody else's right face half that looked significantly more Harrison Ford-ish than his own right face half, then the model actually could do pretty reasonably, I'd expect. That's hilarious. So unless you had somebody where it was like this really strict dichotomous separation of the halves of their face, like Two-Face from Batman or something.

Starting point is 00:09:57 Right. That's another one. In which case maybe the model would fail. But if it's within the standard realm of human variability, I think it would be okay. Well, that's good. So let's move ahead to the algorithms that we're talking about here. And you call them learning algorithms. And you've described your goal for learning algorithms in some intriguing ways. You want to train machines to go out and fetch data for themselves and actively find out about the world. And you want to get the machine to ask itself interesting questions. So it begins to build up its own knowledge base.

Starting point is 00:10:30 Tell us about these learning algorithms for active learning and what it takes to turn a machine into an information seeking missile. Yeah. So this kind of overall objective there that you've described is targeted at kind of expanding the scope of which parts of the problems that we're currently trying to solve are solved by the machine rather than by a person who is acting as a shepherd for the machine or as a teacher or something along those lines. So right now, the machine learning component of most systems is a very important part of the system, but there's a whole lot of human effort that surrounds the production and use of something

Starting point is 00:11:08 like a practical image classifier or a practical machine translation system. So that's one part of the effort that's required for getting an automated system out there in the world. So part of the process is just the initial decision, like the thing that we want to do is machine translation. Here's a way of formalizing that problem and specifying it such that we can go out and now perform another part of the process. So this other part of the process is a data collection. So you'd have to go out and you'd have to explicitly collect a lot of data that is relevant to the task that you're trying to solve. And then you have to take that data and you maybe have to have somebody curate it to make that data more directly useful or more immediately useful for the kinds of algorithms

Starting point is 00:11:52 that we tend to use right now. So a lot of the work that I want to do is about trying to reduce the amount of human effort that's required on those two fronts and trying to get as much of those two parts of the problem automated and built into the models that we're training so that we don't have to go out and manually annotate all the data. Talk to me about the technical end of that. Our listeners are pretty sophisticated

Starting point is 00:12:16 and you're talking about algorithms that are training a machine to do something for itself. Go a little deeper there. Okay, yeah. I'll kind of jump into the learning algorithms for active learning part, which I guess I actually completely skipped over as I was answering the question before. So, training a machine to go out and collect its own data and point to specific examples and say, well, I think if a human comes in and tells me exactly what that thing is, I'll learn a lot about the problem that I'm trying to solve. So this general notion of carefully selecting which of those examples you want to spend the money or spend the time to get a human to go in and provide the annotations for those examples, that's this idea of active learning.

Starting point is 00:13:01 So rather than just assuming that you have a huge batch of data and all of the data is labeled, a lot of practical problems are structured more like you have a lot of unlabeled data and you have to decide how to collect data and apply labels to it so that you can then train a model. So to do this efficiently is you take some of the data, you train a model, and then you look at what the model is doing and you try to figure out where it's weak and where it's strong. And based on where it's weak and where it's strong, you use that to try and decide how to go out and pick other examples specifically so that you can minimize the amount of data that you have to collect and provide annotations to such that you end up with a model that makes good predictions at the end. So that's just active learning, a lot of them revolve around assumptions about what kind of classifier or what kind of decision function you're going to train on that data that you're collecting the labels for. So there might be assumptions that all of the data already has some sort of fixed

Starting point is 00:13:54 representation, and then you're going to feed that representation into a linear classifier, for example. And if you make that kind of assumption, then there might be very good heuristics for going out and deciding which particular sets of features you want to apply labels to. So you can minimize the uncertainty and minimize the number of errors that's made by this linear classifier. But for working with more complicated data or working in scenarios where you also want to learn a powerful representation of the data at the same time that you're collecting the data and applying labels. You might want to sort of transform this process where you decide on what the model is going to be. And then you sit down for weeks or years and come up with a very clever heuristic for how to collect data efficiently to make that model succeed when it has a small amount of labeled data. And you'd like to replace some of those more effort-intensive parts of the process with a machine that can kind of train itself to learn what kinds of data

Starting point is 00:14:50 it's going to need at the same time that you're also training the model that's making the prediction. Let's spend some time talking about your current research, and there's a lot of flavors to it. Let's start with what you're calling Deep InfoMax, or DIM. But I want to point out, too, that in addition to Deep InfoMax, you have augmented multiscale Deep InfoMax or Amdim, spatio-temporal Deep InfoMax, Deep Graph InfoMax. There's a lot of sort of offshoots, I guess you might call it. So I'm going to go sort of free range here because you'll be able to give us a better guided tour of the main idea and all the offshoots better than I will. Tell us about the Deep InfoMax research family and what you're up to. Okay. So the kind of higher level idea that ties these things together

Starting point is 00:15:51 is the idea that we want to learn to represent the data that we're looking at. So sometimes that data might be text, sometimes it might be images, or in the case, for example, of the DeepGraph InfoMax, it might be a graph. So the overall higher level idea of DeepInfoMax is that we want to form representations that act a little bit like an associative memory. Kind of going back to what I was saying about the thing with the split faces before, we can think of the left half of a face and the right half of a face sort of as random variables. So you can think of just sampling the left half of a face and there might be slightly different versions of the right half of that face that are all sort of

Starting point is 00:16:37 valid. So looking at the left half, I guess as you're getting at with the Harrison Ford thing, the right half isn't always perfectly determined. But you can think of the distribution of all possible right half faces, and the variability there is much broader than the variability that you have if you're just looking at what is the right half of Harrison Ford's face

Starting point is 00:17:00 given that we're looking at the left half. So the mutual information between our representation of the left half of the face and the right half of the face is high, when our ability to predict what the right half of the face looks like is very good relative to how well we could predict what the right half of the face looks like in the case where we didn't get to see the left half. If we were just looking at a bunch of different images that had the same shape as the images of the right halves of a face, these images have a lot of variability in their structure. Like some of them, it might be the back half or the front half of a car or something like that, and it looks very different from faces.

Starting point is 00:17:37 So in principle, we can sort of make a reasonable prediction, for example, of whether or not the image that we're looking at right now encodes the right half of a face, but there's still some uncertainty there. And then when we add in the left half of Harrison Ford's face, and we're trying to say, okay, well, out of the distribution of things that look like the right halves of faces, which ones correspond to Harrison Ford, the more precisely we can make that guess, the higher the mutual information is between our representation of the left half and the right half of the face.

Starting point is 00:18:06 Well, let me ask you to go a little deeper on the technical side of this. You sent me a slide that has a lot of algorithmic explanation of deep InfoMax and then how you kind of take that further with augmented multi-scale deep InfoMax? So the actual mutual information aspect sort of formally, the way it shows up here is that we sample this kind of true pair of corresponding image and audio sample. And then we have a distribution from which we can sample just another random audio sample. And we can sample maybe say a thousand of just another random audio sample. And we can sample maybe, say, 1,000 of those other random audio samples, and we can encode them with our audio encoder. And then we can sort of present a little classification problem

Starting point is 00:18:54 to the model that looked at the image, where that classification problem is telling the model that looked at the image to identify which among, let's say, 1,001 audio recordings is the audio recording that comes from that same point in time. So the mutual information here, what we're doing kind of more technically is we're constructing a lower bound on the mutual information between the random variables corresponding to the representation of the image and the representation of the audio

Starting point is 00:19:22 modality. So we first draw a sample from the joint distribution of the representations of those two modalities. And then we also have to sample a lot of samples from what's called the marginal distribution of that second random variable, which is the representations of the audio modality. So we draw, say, a thousand samples from that marginal distribution.

Starting point is 00:19:43 And then we construct this little classification problem where the model is trying to identify which of the audio samples was the sample from the true joint distribution over audio and visual data, and which of the samples just came from random samples from the marginal distribution. So this is a technique called noise contrastive estimation

Starting point is 00:20:01 that's been developed and applied in a lot of different scenarios. So a good example of this is techniques that have been used for training word vectors. But in the case where we're using it, it's a technique that can be used for constructing kind of a formally correct lower bound on the mutual information between these two random variables. One of which corresponds to samples of visual data and one of which corresponds to samples of audio data. And the joint distribution over those two kind of random variables is constructed by just going around the world with a camera and a microphone and just taking little snippets of visual and audio information from different points in time and in different scenes.

Starting point is 00:20:40 All right. Well, as you describe Deep Info Max and then you have augmented multi-scale Deep InfoMax, you call that improving Deep InfoMax based on some limitations in the prior. How would you differentiate how the augmented multi-scale Deep InfoMax is better than the original idea, depending on specifically how you implement it, has some significant downsides in some sense. The original DeepInfoMax was just looking at a single version of a single image. And in this case, there's sort of an issue where if you're just looking at this single image and let's say encoding all of the little patches in the image, the way that the original DeepInfo Max presentation kind of goes is that you take that image, you encode each of the patches, and you also encode the whole image. And so here we're going to sort of train the representation of the whole image such that it can look at all

Starting point is 00:21:36 of these patches and say that, oh, those are patches that came from my image. So this is a little bit like that idea of associative memory, but it's applied on sort of a single input. So kind of procedurally how you would do this is that you would take an image, you would encode it, you get representations of all the little patches, and you get a representation of the whole image. And now you're gonna construct a little classification problem

Starting point is 00:21:58 where you take a thousand other images and you also encode their patches. And you sort of mix them into a bucket with all the patches from the original image that you computed a full image encoding for. And the job of the full image encoding is to look in that bucket and essentially pick out all the patches

Starting point is 00:22:13 that are part of its image. So one of the difficulties here, like one of the shortcomings of that particular way of formulating it, if you take that more restrictive interpretation, the main downside is that the encoder that's processing the full image can downside is that the encoder that's processing the full image can basically just memorize the content that's there. And it's fairly

Starting point is 00:22:29 easy for the model to just copy that information into the representation of the whole image. And essentially, it's just a memory that stores the representations of all the little patches. There might be some areas in which this is useful, but for some types of predictive tasks, it might not be so useful because you're not really asking the representation of the whole image to answer sort of interesting predictive problems about what kinds of other things might you see that weren't explicitly in the image that you're looking at now. So if you're looking at left half of faces and right half of faces, if instead of looking at left half of face and right half of face, all you did was you showed your encoder this left half of the face, you encoded it to a small vector, and then you showed it the same left half again. And you said, is this the one that you looked at before? The model might not actually have to learn that much to be able to solve that task really well. But if you take it and you change it to a task where the kinds of predictions that you're forcing that representation to make are a little bit more interesting, you can ask a more

Starting point is 00:23:28 interesting question, which is like, did this eye come from the right half of the face whose left half you looked at? So here now the model is answering kind of a more challenging question. This is one of the main changes that we make when we go from the original formulation of the deep InfoMax to this augmented deep InfoMax. So this is the augmented part, not the multi-scale part. That's another thing that we're looking at multiple scales of representation. But if we just look at the augmented part, kind of the big improvement there is that we're forcing the model to answer questions or form an associative memory where the associations that we're forcing it to make are more challenging to make so that the model has to put a little more effort into how it's going to represent the data.

Starting point is 00:24:23 I like to explore consequences, Phil, both intended and otherwise, that new technologies inevitably have on society. And this is the part of the podcast where I always ask, what can possibly go wrong? So you're working in a lab that has a stated aim of teaching machines to read, think, and communicate like humans. Is there anything about that that keeps you up at night? And if so, what is it? And more importantly, what are you doing to address it? So we do have a group here that's working on what we call fairness, accountability, transparency, and ethics.

Starting point is 00:24:58 So it's the FATE group. So they're working on a lot of questions that are, let's say, immediately relevant as opposed to questions that are kind of long-term relevant or irrelevant, depending on your perspective. So there's this idea of existential risk, which is more of a long-term question. So this is the kind of question like, well, if we develop a superhuman AI, is it going to care about us and take care of us or is it going to consume us in its quest for more resources? So we'll set that aside and so like the more immediately salient one is the kinds of things that the fake group is looking at and so these are things like well if we're training a system that's going to sit at a bank and analyze people's credit history are there historical trends in the data that might be due to systemic discrimination

Starting point is 00:25:46 or systemic disadvantaging of particular groups of people that are going to be reflected in the data that we use to train our system such that then when the system goes to make decisions, it's kind of implicitly or accidentally discriminating against these groups just due to the fact that they were also historically discriminated against and that's reflected in the data that we're using to train the system. So me personally, a great thing that I could do would be create something that's like the internal combustion engine of machine learning or even like the steam engine. Those things have had an incredible effect on society and that's been very empowering and it's helped with a lot of progress but it also makes it easier for people to do bad things at scale so i'm kind of more worried about that type of problem and i think that that type of problem isn't necessarily a

Starting point is 00:26:36 technological problem it's a little bit more of a system or social problem because i think the technology is going to happen and so kind of the things that worry me there are along the lines of like seeing the technology and the way in which it increases people's leverage over the world and ability to affect it kind of at scale. I guess for me on a day-to-day basis, like I don't think about it too much as I'm doing research because to me, again, it's not really so much of a technical problem. I think it would be very hard to design the technology so that it can't do bad things. Well, listen, I happen to know you didn't start out in Montreal. So tell us a little bit about yourself. What got a young Phil Bachman interested in computer science and how did he land

Starting point is 00:27:20 at Microsoft Research in Montreal? I kind of always grew up with a computer in the home. I was fortunate in that sense that I was always around computers and I could use them for playing games and I could do a little bit of programming. And I'm not old, but I'm not in the youngest demographic that you would see working in tech.

Starting point is 00:27:39 And one of the things that I really liked when I was in high school, I started playing a lot of these first-person games where you kind of run around and you shoot things. For better or worse, it was fun. So one of the things that was a challenge at first for me was I didn't have great internet. So what I would do is go to the school library and look around. And it turned out that you could download some bots that people had made. So you could sort of fake the multiplayer kind of experience.

Starting point is 00:28:05 So I thought that was really cool and one of the things i had you know started thinking about there was okay well you know what is it that these bots are actually doing so i've been doing a little bit of coding and like making some little simple games so thinking about that like how would we automate this little thing that kind of is fairly simple at its core, but that when we let it loose in this environment, so like when we let it run around and compete with the other players, it does something interesting and fun. And so that was sort of always at the back of my mind a bit, I guess. And I bounced around a little bit academically and starting doing research in a slightly different field. But then eventually I kind of sat around and watched a bunch of online

Starting point is 00:28:46 lectures. And there were a couple areas of machine learning, like reinforcement learning, for example, that really started to click with me and that I was excited about because it was getting back to those kinds of questions I'd asked myself about before, like, how do we get this little bot to do interesting things? So that brought me from Texas because I was in grad school in Texas after having done my undergraduate studies in New York. But then I found this group that was in Montreal doing reinforcement learning. So I came and I worked with that group and of the jobs that were available elsewhere and an exciting opportunity popped up here so there was a startup called Maluba that was based out of Waterloo and it was developing kind of technology and software for doing virtual personal assistance and the company wanted to sort of start getting more aggressive about pushing their technology forward so they came to Montreal because there was a lot of machine learning cool stuff happening in Montreal and then opened a research lab. And basically as those lab doors were opening, I walked in and joined the company. And then about a year later,

Starting point is 00:29:54 we were actually acquired by Microsoft. So that's how I ended up in MSR. Well, at the risk of heading into uncomfortable icebreaker question territory, Phil, tell us one interesting thing about yourself that people might not know and how has it influenced your career as a researcher? And even if it didn't, tell us something interesting about yourself anyway. Personally, I'd say one thing that I've always enjoyed is being fairly involved in at least one type of, let's say, goal-oriented physical activity. That's a super weird sounding description. But for example, as an undergrad, I did a lot of rock climbing. So having that as a thing where I could just really be focused and apply myself to solving problems in some sense. A lot of climbing is about kind of planning out what you're going to do. And it's a little bit like solving a puzzle sometimes. And having that as a thing that's sort of separate

Starting point is 00:30:48 from the work I do, but that still is kind of mentally and also physically active and being able to kind of apply myself to that strongly. I don't do the rock climbing specifically anymore, but what I do now is I play a lot of soccer. So I really enjoy the combination of the physical aspect as well as the mental aspect of the game. So there's a lot of extemporaneous kind of inventive thinking, and it can be very satisfying when you kind of do something that's exactly right at exactly the right time, especially when you realize later that you didn't really even think about it. It just sort of happened. I guess that might be related to some of the better moments as a researcher that you have when you're trying to solve a problem

Starting point is 00:31:27 and you're just kind of messing around and then something just sort of clicks and you just kind of see how you should do it. At the end of every podcast, I give my guests the proverbial last word. So tell our listeners from your perspective, what are the big challenges out there right now in research directions that might address them when we're talking about machine learning research and what's hype and what's hope and what's the future? I guess one that I would say is filtering through all the different things that people are writing and saying and trying to figure out which parts of what they're saying seem new, but they're really just kind of a rewording of some concept that

Starting point is 00:32:09 you're familiar with. And you just kind of have to rephrase it a little bit and then see how it fits into your existing internal framework and being able to use that ability to figure out what's new and what's different and figure out how it differs from what people were trying before. And that allows you to be kind of more precise in your guesses about what is actually important. But a lot of that sort of washes out in the end and it doesn't really survive that long. Sort of at the beginning as a researcher, you have to rely on other people because you don't really know where you're going yet. But over time, taking those training wheels off a little bit and developing your own personal internal framework

Starting point is 00:32:49 for how you think about problems so that when you get new information, you can kind of quickly contextualize it and figure out which are the new bits that are actually going to change the way that you look at things and which bits are sort of just a different version of something that you already have. Phil Bachman, thanks for joining us from Montreal today.

Starting point is 00:33:08 Yeah, thanks for having me. To learn more about Dr. Phil Bachman and the latest research in machine learning, visit Microsoft.com slash research.

Microsoft Research Podcast - 101 - Going meta: learning algorithms and the self-supervised machine with Dr. Philip Bachman

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.