Microsoft Research Podcast - AI Frontiers: A deep dive into deep learning with Ashley Llorens and Chris Bishop

Starting point is 00:00:00 I'm Ashley Lorenz with Microsoft Research. I've spent the last 20 years working in AI and machine learning, but I've never felt more excited to work in the field than right now. The latest foundation models and the systems we're building around them are exhibiting surprising new abilities in reasoning, problem-solving, and translation across languages and domains. In this podcast series, I'm sharing conversations with fellow researchers about the latest developments in AI models, the work we're doing to understand their capabilities and limitations,

Starting point is 00:00:36 and ultimately, how innovations like these can have the greatest benefit for humanity. Welcome to AI Frontiers. Today, I'll speak with Chris Bishop. Chris was educated as a physicist, but has spent more than 25 years as a leader in the field of machine learning. Chris directs our AI for Science organization, which brings together experts in machine learning and across the natural sciences with the aim of revolutionizing scientific discovery. So Chris, you have recently published a new textbook on deep learning. Maybe the new definitive textbook on deep learning, time will tell.

Starting point is 00:01:13 So of course I want to get into that. But first, I'd like to dive right into a few philosophical questions. In the preface of the book, you make reference to the massive scale of state-of-the-art language models, generative models comprising on the order of a trillion learnable parameters. How well do you, we understand the systems extremely well because we designed them, we built them. But what's very interesting about machine learning technology compared to most other technologies is that the functionality in large part is learned, is learned from data. And what we discover in particular with these very large language models is kind of emergent behavior. As we go up in each factor of 10 in scale, we see qualitatively new properties and capabilities emerging. And that's super interesting. That was called the scaling hypothesis, and it's proven to be remarkably successful. Your new book lays out

Starting point is 00:02:17 foundations and statistics and probability theory for modern machine learning. Central to those foundations is the concept of probability distributions, in particular, learning distributions in the service of helping a machine perform a useful task. For example, if the task is object recognition, we may seek to learn the distribution of pixels you'd expect to see in images corresponding to objects of interest, like a teddy bear or a race car. On smaller scales, we can at least conceive of the distributions that machines are learning. What does it mean to learn a distribution at the scale of a trillion learnable parameters? Right, it's really interesting. So first of all, the fundamentals are very solid. The fact that we have this

Starting point is 00:03:13 foundational rock of probability theory on which everything is built is extremely powerful. But then these emergent properties that we talked about are the result of extremely complex statistics. What's really interesting about these neural networks, let's say in comparison with the human brain, is that we can perform perfect diagnostics on them. We can understand exactly what each neuron is doing at each moment of time. And so we can almost treat the system in a sort of somewhat experimental way. We can probe the system. You can apply different inputs and see how different units respond. You can play games like looking at a unit that responds to a particular input and then perhaps amplifying that response,

Starting point is 00:03:46 adjusting the input to make that response stronger, seeing what effect it has, and so on. So there's an aspect of machine learning these days that's somewhat like experimental neurobiology, except with a big advantage that we have sort of perfect diagnostics. Another concept that is key in machine learning is generalization. In more specialized systems, often smaller systems, we can actually conceive of what we might mean by generalizing. In the object recognition example I used earlier, we may want to train an AI model capable of recognizing any arbitrary image of a teddy bear. Because this is a specialized task, it is easy to grasp what we mean by generalization. But what does generalization mean in our current era of large-scale AI models and systems?

Starting point is 00:04:37 Right. Well, generalization is a fundamental property, of course. If we couldn't generalize, there'd be no point building these systems. And again, these foundational principles apply equally at a very large scale as they do at a smaller scale. But the concept of generalization really has to do with modeling the distribution from which the data is generated. So if you think about a large language model, it's trained by predicting the next word or predicting the next token. But really what we're doing is creating a task for the model that forces it to learn the underlying distribution. Now that distribution may be extremely complex, let's say in the case of natural language, it can convey a tremendous amount of meaning. So really the system is forced to, in order to get the best possible performance, in order to make the best prediction

Starting point is 00:05:21 for the next word, if you like. It's forced to effectively understand the meaning of the content of the data, in the case of language, the meaning of effectively what's being said. And so from a mathematical point of view, there's a very close relationship between learning this probability distribution and the problem of data compression. Because it turns out if you want to compress data in a lossless way,

Starting point is 00:05:48 the optimal way to do that is to learn the distribution that generates the data. We show that in the book, in fact. And so the best way to... Let's take an example of images, for instance. If you've got a very, very large number of natural images and you had to compress them, the most efficient way to compress them would be to understand the mechanisms by which the images come about. There are objects, you could pick a car or a bicycle or a house,

Starting point is 00:06:14 there's lighting from different angles, shadows, reflections, and so on. And learning about those mechanisms, understanding those mechanisms will give you the best possible compression, but it will also give you the best possible generalization. Let's talk briefly about one last fundamental concept, inductive bias. Of course, as you mentioned, AI models are learned from data and experience. And my question for you is, to what extent do the neural architectures underlying those models represent an inductive bias that shapes the learning? This is really a good question as well, and it sort of reflects the journey that neural nets have been on in the last 30,

Starting point is 00:06:54 35 years since we first started using gradient-based methods to train them. So the idea of inductive bias is that actually you can only learn from data in the presence of assumptions. There's actually a theorem called the no free lunch theorem, which proves this mathematically. And so to be able to generalize, you have to have data and some sort of assumption, some set of assumptions. Now, if you go back 30 years, 35 years, when I first got excited about neural nets, we had very simple one and two layer neural nets. We had to put a lot of assumptions in. We'd have to code a lot of human expert knowledge into feature extraction. And then the neural net would do a little bit of the last little bit of work

Starting point is 00:07:35 of just mapping that into a sort of linear representation and then learning a classifier or whatever it was. And then over the years, as we learned to train bigger and richer neural nets, we can allow the data to have more influence. And then we can back off a little bit on some of that prior knowledge. And today, when we have models like large-scale transformers with a trillion parameters learned on vast data sets, we're letting the data do a lot of the heavy lifting. But there always has to be some kind of assumption. So in the case of transformers, there are inductive biases related to the idea of attention.

Starting point is 00:08:09 So that's a specific structure that we bake into the transformer. And that turns out to be very, very successful. But there's always inductive bias somewhere. Yeah, and I guess with these new generative pre-trained models, there's also some inductive bias you're imposing in the inferencing stage, just with the way you prompt the system. And again, this is really interesting.

Starting point is 00:08:34 The whole field of deep learning has become incredibly rich in terms of pre-training, transfer learning, the idea of prompting zero-shot learning. The field has exploded, really, in the last 10 years, the last five years, not just in terms of the number of prompting zero-shot learning. The field has exploded, really, in the last 10 years, the last five years, not just in terms of the number of people and the scale of investment, number of startups and so on, but the sort of the richness of ideas and techniques like auto-differentiation, for example, that mean we don't have to code up all the gradient optimization steps

Starting point is 00:09:00 and allows us to explore a tremendous variety of different architectures very easily very readily so it's become just an amazingly exciting field in the last decade and i guess we've sort of um intellectually pondered here in the first few minutes um the current state of the field but what was it like for you when you first used um you know a state of the art foundation model what was that moment like for you? Oh, I can remember it clearly. I was very fortunate because I was given, as you were, I think, a very early access to GPT-4 when it was still very secret.

Starting point is 00:09:33 And I describe it as being like the five stages of grief. It's a sort of an emotional experience, actually. For me, it was like a sort of first encounter with a primitive intelligence compared to human intelligence but nevertheless it was it felt like this is the first time i've ever engaged with an intelligence that was sort of human-like and had those first sparks of of of of human level intelligence and uh and and i found myself going through these various stages of first of all thinking no this is sort of a parlor trick.

Starting point is 00:10:06 This isn't real. And then it would do something or say something that would be really quite shocking and profound in terms of it clearly was understanding aspects of what was being discussed. And I had several rounds of that. And then the next day I think, was that real? Did I imagine that? And go back and try again. And no, there really is something here. So clearly, we have quite a way to go before we have systems that really match the incredible capabilities of the human brain. But nevertheless, I felt that, you know, after 35 years in the field, here I was encountering the first sparks, the first hints of real machine intelligence. Now let's get into your book. I believe this is your third textbook. You contributed a text called Neural Networks for Pattern Recognitions in 95,

Starting point is 00:10:55 and a second book called Pattern Recognition and Machine Learning in 2006, the latter still being on my own bookshelf. So I think I can hazard a guess here, but what inspired you to start writing this third text? Well, really it began with, actually the story really begins with the COVID pandemic and lockdown. It was 2020. The 2006 pattern recognition machine Learning book had been very successful, widely adopted, still very widely used, even though it predates the deep learning revolution, which, of course, is one of the most exciting things

Starting point is 00:11:32 to happen in the field of machine learning. And so it's long been on my list of things to do to update the book, to bring it up to date, to include deep learning. And when the pandemic lockdown arose, 2020, I found myself sort of in prison, effectively, at home with my family, a very, very happy prison. But I needed a project. I thought this would be a good time to start to update the book. And my son, Hugh, had just finished his degree in computer science at Durham and was embarking on a master's degree at Cambridge

Starting point is 00:12:05 in machine learning. And we decided to do this as a joint project during the lockdown. And we're having a tremendous amount of fun together. We quickly realized, though, that the field of deep learning is so rich and obviously so important these days, that what we really needed was a new book rather than merely, you know, a few extra chapters or an update to a previous book. And so we worked on that pretty hard for nearly a couple of years or so. And then the story took another twist because Hugh got a job at Wave Technologies in London, building deep learning systems for autonomous vehicles. And I started a new team in Microsoft called AI for Science.

Starting point is 00:12:44 We both found ourselves extremely busy and the whole project kind of got put on the back burner. And then along came GPT and chat GPT. And that sort of exploded into the world's consciousness. And we realized that if ever there was a time to finish off a textbook on deep learning, this was the moment. And so the last year has really been absolutely flat out getting this ready, in fact, ready in time for launch at NeurIPS this year. Yeah, you know, it's not every day you get to do something like write a textbook with your son. What was that experience like for you?

Starting point is 00:13:18 It was absolutely fabulous. And I hope it was good fun for Hugh as well. One of the nice things was that it was a kind of a pure collaboration. There was no divergence of agendas or any sense of competition. It was just pure collaboration, the two of us working together to try to understand things, try to work out what's the best way to explain this. And if we couldn't figure something out, we'd go to the whiteboard together and sketch out some maths and try to understand it together. And it was just tremendous fun, just a real pleasure, a real honour, I would say.

Starting point is 00:13:49 One of the motivations that you articulate in the preface of your book is to make the field of deep learning more accessible for newcomers to the field, which makes me wonder what your sense is of how accessible machine learning actually is today compared to how it was, say, 10 years ago. On the one hand, I personally think that the underlying concepts around transformers and foundation models are actually easier to grasp than the concepts from previous eras of machine learning. Today, we also see a proliferation of helpful packages and toolkits that people can pick up and use. And on the other hand, we've seen an explosion in terms of the scale of compute necessary to do research at the frontiers.

Starting point is 00:14:35 So, Ned, what's your concept of how accessible machine learning is today? I think you've hit on some good points there. I would say that the field of machine learning has really been through these three years. The first was the focus on neural networks. The second was when sort of neural networks went into the back burner. As you hinted there, there was a proliferation of different ideas, Gaussian processes, graphical models, kernel machines, support vector machines, and so on. And the field became very broad. There are many different concepts to learn. Now, in a sense, it's narrowed. The focus really is on deep neural networks. But within that field, there has been an explosion of different architectures and different, and not only in terms of the number of architectures, just the sheer number of papers published has literally exploded. And so it can be very daunting, very intimidating, I think, especially for somebody coming into the field afresh. And so it can be very daunting, very intimidating, I think, especially for somebody coming into the field afresh. And so really the value proposition of this book is distill out the 20 or so foundational ideas and concepts that you really need to understand in order to understand the field. And the hope is that if you've really understood the content of the book, you'll be in pretty good shape to pretty much read any paper that's published.

Starting point is 00:15:49 In terms of actually using the technology in practice, yes, on the one hand, we have these wonderful packages, especially with auto-differentiation that I mentioned before, is really quite revolutionary. And now you can put things together very, very quickly, a lot of open source code that you can quickly bolt together and assemble lots of different things, try things out very easily. It's true, though, that if you want to operate at the very cutting edge of large-scale machine learning, that does require resources on a very large scale, so that's obviously less accessible. But if your goal is to understand the field of machine learning, then I hope the book will serve a good purpose there.

Starting point is 00:16:22 And in one sense, the fact that the packages are so accessible and so easy to use really hides some of the inner workings, I would say, of these systems. And so I think it's, in a way, it's almost too easy just to train up a neural network on some data without really understanding what's going on.

Starting point is 00:16:40 So the book is really about, if you like, the minimum set of things that you need to know about in order to understand the field, not just to sort of turn the crank on a package, but really understand what's going on inside. One of the things I think you did not set out to do, as you just mentioned, is to create an exhaustive survey of the most recent advancements, which might have been possible, you know, a decade or so ago. How do you personally keep up with the blistering pace of research these days? Yes, it's a challenge, of course. So my focus these days is on AI for science, AI for natural science. But that's also becoming a very large field. But, you know, one of the things about being at Microsoft Research

Starting point is 00:17:25 is just having fantastic colleagues with tremendous expertise. And so a lot of what I learn is from colleagues. And we're often swapping notes on, you know, you should take a look at this paper. Did you hear about this idea? And so on and brainstorming things together. So a lot of it is, you know, just taking time each day to read papers. That's important. But also just conversations with colleagues. Okay, you mentioned AI for science. I do want to get into that. I know it's an area that you're passionate about and one that's become a focus for your career in this moment.

Starting point is 00:17:58 And I think of our work in AI for science as creating foundation models that are fluent, not in human language, but in the language of nature. And earlier in this conversation, we talked about distribution. So I want to kind of bring it back there. Do you think we can really model all of nature as one wildly complex statistical distribution? Well, that's really interesting. I do think I could imagine a future, maybe not too many years down the road, where scientists will engage with the tools of scientific discovery through something like a natural language model. That model will also have understanding of

Starting point is 00:18:37 concepts around the structures of molecules and the nature of data. We'll read scientific literature and so on and be able to assemble these ideas together. But it may need to draw upon other kinds of tools. So whether everything will be integrated into one overarching tool is less clear to me because there are some aspects of scientific discovery that are truly being revolutionized right now

Starting point is 00:19:00 by deep learning. For example, our ability to simulate the fundamental equations of nature is being transformed through deep learning. For example, our ability to simulate the fundamental equations of nature is being transformed through deep learning. And the nature of that transformation, on the one hand, it leverages architectures like diffusion models and large language models, large transformers, and the ability to train on large GPU clusters. But the fundamental goals there are to solve differential equations at a very large scale. And so the kinds of techniques we use there are a little bit different

Starting point is 00:19:31 from the ones we'd use in processing natural language, for example. So you could imagine, maybe not too many years in the future, where a scientist will have a kind of super co-pilot that they can interact with directly in natural language. And that co-pilot or system of co-pilots can itself draw upon various tools. They may be tools that simulate Schrodinger's equation, Sol Schrodinger's equation to predict the properties of molecules. It might call upon large-scale deep learning emulators

Starting point is 00:19:58 that can do a similar thing to the simulators but very much more efficiently. It might even call upon automated labs, wet labs that can run experiments and gather data and can help the scientist marshal these resources and make optimal decisions as they go through that iterative scientific discovery process, whether inventing a new battery electrolyte or whether discovering a new drug, for example. We talked earlier about the no free lunch theorem and the concept of inductive bias. What does that look like here in training science foundation models? Well, it's really interesting.

Starting point is 00:20:34 And maybe I'm a little biased because my background is in physics. I did a PhD in quantum field theory many decades ago. For me, one of the reasons that this is such an exciting field is that, you know, my own career has come full circle. I now get to combine machine learning with physics and chemistry and biology. I think the inductive bias here is particularly interesting. If you think about large language models, we don't have very many sort of fundamental rules of language. I mean, the rules of linguistics are really human observations about the structure of language, but neural nets are very good at extracting that kind of structure

Starting point is 00:21:11 from data. Whereas when we look at physics, we have laws which we believe hold very accurately, for example, conservation of energy or rotational invariance. The energy of a molecule in a vacuum doesn't depend on its rotation in space, for example. And that kind of inductive bias is very rigorous. We believe that it holds exactly. And so there is, and also very often we want to train on data that's obtained from simulators. So the training data itself is obtained by solving some of those fundamental equations. And that process itself is computationally expensive. So the data can often be in relatively limited supply.

Starting point is 00:21:52 So you're in a regime that's a little bit different from the large language models. It's a little bit more like, in a way, machine learning was, you know, 10 or 20 years ago, as you were talking about. Data is limited, but now we have these powerful and strong inductive biases. And so it's a very rich field of research for how to build in those inductive biases into the machine learning models, but in a way that retains computational efficiency. So I personally actually find this one of the most exciting frontiers, not only of the natural sciences, but also of machine learning. Yeah, you know, physics and our understanding of the natural world has come so far, you know, over the last, you know, centuries and decades.

Starting point is 00:22:32 And yet our understanding of physics is evolving. It's an evolving science. And so maybe I'll ask you somewhat provocatively if baking our current understanding of physics into these models as inductive biases is limiting in some way, perhaps limiting their ability to learn new physics? It's a great question. I think for the kinds of things that we're particularly interested in in Microsoft Research, in the AFS science team, we're very interested in things that have real world applicability, things to do with drug discovery, materials design. And there, first of all, we do have a very good understanding of the fundamental equations,

Starting point is 00:23:08 essentially Schrodinger's equation and fundamental equations of physics, and those inductive biases such as energy conservation. We really do believe they hold very accurately in the domains that we're interested in. However, there's a lot of scientific knowledge that represents approximations to that because you can only really solve these equations exactly for very small systems. And as you start to get to larger, more complex systems, there are, as it were, laws of physics that aren't quite as rigorous, that are somewhat more empirically derived, where there perhaps is scope for learning new kinds of physics. And certainly as you get to larger systems, you get emergent properties. So conservation of energy doesn't get violated, but nevertheless, you can have very interesting

Starting point is 00:23:52 new emergent physics. And so it's from the point of view of scientific discovery, I think the field is absolutely wide open if you look at solid state physics, for example, and device physics. There's a tremendous amount of exciting new research to be done over the coming decades. You alluded to this. I think maybe it's worth just double clicking on for a moment because there is this idea of compositionality

Starting point is 00:24:16 and emergent properties as you scale up. And I wonder if you could just elaborate on that a little bit. Yeah, that's a good sort of picture to have this sort of hierarchy of different levels and the way they interact with each other. And at the very deepest level, the level of electrons, you might even more or less directly solve Schrodinger's equation or do some very good approximation to that. That quickly becomes infeasible. And as you go up this hierarchy of effectively length scales you have to make more

Starting point is 00:24:45 and more approximations in order to be computationally efficient or computationally even practical but in a sense the previous levels of the hierarchy can provide you with training data and with validation verification of what you're doing at the next level. And so the interplay between these different hierarchies is also very, very, very interesting. So at the level of electrons, they govern forces between atoms, which governs the dynamics of atoms. But once you look at larger molecules, you perhaps can't simulate the behavior of every electron. You have to make some approximations. And then for larger molecules still, you can't even track the behavior of every atom. You need some sort of coarse graining and so on and so you have this this

Starting point is 00:25:30 hierarchy of different length scales but every single one of those length scales is being transformed by deep learning by our ability to learn from simulations learn from those fundamental equations in some cases learn also from experimental data, and build emulators effectively, systems that can simulate that particular link scale and the physical and biological properties, but do so in a way that's computationally very efficient. So every layer of this hierarchy is currently being transformed, which is just amazingly exciting. You alluded to some of the application domains that stand to get disrupted by advancements in AI for science. What are a couple of the

Starting point is 00:26:11 applications that you're most excited about? There are so many, it would be impossible to list them, but let me give you a couple of domains. I mean, the first one is healthcare and the ability to design new molecules, whether it's small molecule drugs or more protein based therapies, that whole field is rapidly shifting to a much more computational domain and that should accelerate our ability to develop new therapies, new drugs. The other class of domains has more to do with materials and there are a lot of the applications that we're interested in relate to sustainability, things to do with capturing CO2 from the atmosphere, creating, let's say, electricity from hydrogen,

Starting point is 00:26:50 creating hydrogen from electricity. We need to do things both ways around. Just storing heat as a form of energy storage. Many, many applications relating to sustainability to do with protecting our water supply, to do with providing green energy, to do with storing and transporting energy. Many, many applications. And at the core of all those advancements is deep learning, as we kind of started. And so maybe as we close, we can kind of come back to your book on deep learning. I don't have the physical book yet, but there's a spot on my shelf next to your last book that's waiting for it.

Starting point is 00:27:31 But as we close here, maybe you can tell folks where to look for or how to get a copy of your new book. Oh, sure. It's dead easy. You go to bishopbook.com, and from there you'll see how to order a hardback copy if that's what you'd like or there's a PDF based ebook version there'll be a Kindle version I believe but there's also a free to use online version on bishop book comm and it's available there it's sort of PDF style and fully hyperlinked free to use and I hope people will read it and enjoy it and learn from it thanks for a fascinating discussion Chris thanks Ashley

Microsoft Research Podcast - AI Frontiers: A deep dive into deep learning with Ashley Llorens and Chris Bishop

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.