Sean Carroll's Mindscape: Science, Society, Philosophy, Culture, Arts, and Ideas - 336 | Anil Ananthaswamy on the Mathematics of Neural Nets and AI

Episode Date: November 24, 2025

Machine learning using neural networks has led to a remarkable leap forward in artificial intelligence, and the technological and social ramifications have been discussed at great length. To understan...d the origin and nature of this progress, it is useful to dig at least a little bit into the mathematical and algorithmic structures underlying these techniques. Anil Ananthaswamy takes up this challenge in his book Why Machines Learn: The Elegant Math Behind Modern AI. In this conversation we give a brief overview of some of the basic ideas, including the curse of dimensionality, backpropagation, transformer architectures, and more. Blog post with transcript: https://www.preposterousuniverse.com/podcast/2025/11/24/336-anil-ananthaswamy-on-the-mathematics-of-neural-nets-and-ai/ Support Mindscape on Patreon. Anil Ananthaswamy received a Masters degree in electrical engineering from the University of Washington, Seattle. He is currently a freelance science writer and feature editor for PNAS Front Matter. He was formerly the deputy news editor for New Scientist, a Knight Science Journalism Fellow at MIT, and journalist-in-residence at the Simon Institute for the Theory of Computing, University of California, Berkeley. He organizes an annual science journalism workshop at the National Centre for Biological Sciences at Bengaluru, India. Web site Amazon author page Wikipedia

Transcript
Discussion (0)
Starting point is 00:00:00 In California, staying compliant means watching the state laws and the city rules at the same time. And no wonder it feels overwhelming. Meal breaks, rest breaks, wage rules, constant updates, it's a lot. And that's why Southern California businesses rely on Guardian HR. They're local in L.A. And they understand this community and they help you stay compliant to avoid costly missteps. You get accurate payroll software and a real HR expert who keeps you ahead of issues. Get your Southern California business protected at GuardianHR.
Starting point is 00:00:30 Hey, I just Venmoed you for rent. Nice. Now I can instantly spend it whether I'm checking out online with Venmo or using a Venmo debit card. Say more. More exactly. Because the more you do with Venmo, the more you get. Like earning up to 5% cashback with Venmo stash on a bundle of brands. So, order more pizza. The math demands it. Get the Venmo debit card.
Starting point is 00:00:52 Venmo Stash bundle terms and exclusions apply. See terms of Venmo.com.com. Venmo checkout not available at all merchants. Vimmo MasterCard is issued by the Bankor Bank N.A. Hello, everyone. Welcome to the Mindscape podcast. I'm your host, Sean Carroll. We all know that artificial intelligence in various forms has been exploding over the last
Starting point is 00:01:10 couple of years. After decades of effort and various summers and winters in AI research, we clearly have crossed some threshold where AI is being put into use in all sorts of different places. Now, we can debate the words artificial intelligence, right? Is it really intelligence? You know, that's the large language models, which are a particular approach to AI, which have really gotten all the attention lately. They're based on a broader idea called neural networks or deep learning, which has been all over the place for a long time. Something like Google Maps uses this kind of technology.
Starting point is 00:01:48 But now with the more human-like behavior of AI in the form of large language models, they become much more ubiquitous. and there's been a wild range of reactions to what is going on. Some people saying that maybe they'll become super intelligent and take over the world and that's a danger. Other people just complaining that they can't download a new software or an app without it being infused with AI that they don't really want. So I'm not myself very sure what the long-term impact of AI is going to be, at least in this sort of large language model incarnation. A couple years ago, when it first became a big thing, I said that it's probably somewhere between the impact of cell phones and electricity. And still, that's a lot of impact one way or the other, right?
Starting point is 00:02:39 Cell phones have had a lot of impact on our lives, but not really completely changing the way we live. I think that's a minimal expectation for the impact that AI will have for better or for worse, whereas the larger thing of, you know, the impact level of electricity is maybe the upper level of where it could possibly reach. None of us knows. Like, that's my personal guess. Various people have very strong opinions one way or the other. Many of them are more educated than mine.
Starting point is 00:03:06 But we all should be a little bit educated about what this technology is that is affecting us so much. So that's what we're here to do today. Today's guest is Anil Ananthaswamy, who is a science writer, actually, a former editor at new scientist and now mostly a freelance science writer, but he got his start in engineering and computer engineering in particular. And when large language models came along, he became sort of re-facinated by this aspect of technology and dived into it. So in addition to being a science writer, he's actually thinking at quite a advanced level
Starting point is 00:03:43 about AI and what it is. And the book that has resulted from this thinking is called Why Machines Learned, the elegant math behind modern AI. It's actually, it came out last year. This is my fault for waiting so long to talk about it. Daniel's a person, a friend of mine that I've known for a long time. But it's a book chock full of big ideas and mathematics. It's exactly in the spirit of my own biggest ideas, books, etc. That really doesn't just say, well, you know, AI is going to take over the world. It says, here is why you need to understand how to diagonalize a matrix to understand, to understand what AI is really telling us.
Starting point is 00:04:21 Now, there's a lot more in the book than we could possibly cover in a one-hour podcast, so we hit some highlights. But I think that for me, this conversation was extraordinarily helpful in clarifying which advance in AI technology came first, what the importance of it was, what it led to later. The actual math that is used is on the one hand fascinating. On the other hand, as we say, in the podcast, mostly classical in the sense, like It's not like you're developing new math in order to be able to do this.
Starting point is 00:04:50 It's not like you're using the most advanced reaches of modern category theory or topology to figure things out. You're applying math in very, very, very large dimensional spaces that only computers can really handle. And that's enough to do something that is very, very different than anything that's been done before. So trying to understand it the best we can is a worthwhile endeavor. Let's go. Anil Ananswami, welcome to the Mindscape Podcast. Thank you, Sean. It's my pleasure. So you've written a book about AI. That doesn't single you out. There's lots of people who've written books about AI. You've decided to write a book about the mathematics behind AI, which is an interesting choice. What is it that led you to that?
Starting point is 00:05:49 Oh, it began well before the AI craziness came about. I think sometime in 2016, 2017, I started noticing a whole. bunch of stories that I was beginning to do as a journalist that had a machine learning component. Right. You know, and when I became a journalist, this was when I transitioned from being engineer to being a journalist, I was writing mostly about physics and neuroscience. And when I would write about those subjects like particle physics, I was happy just doing as much research as I could and understanding it to the best of my ability and then writing about it. I never had any illusions about being able to do particle physics or do neuroscience. But when I
Starting point is 00:06:38 started encountering stories in machine learning, I think when I would talk to the researchers explaining their algorithms, their machine learning models, I think I certainly felt, hang on, this is something I could do. I could do this. Not in the way they were doing it, but I could certainly get my hands dirty because of the software background, because of my engineering background. And so what happened was I got a fellowship at MIT, the Knight Science Journalism Fellowship. Right. And as part of that fellowship, we had to do projects. And the project I took on was essentially teaching myself deep learning.
Starting point is 00:07:17 So the project question was, could a deep learning or a deep neural network do what Kepler did? So it was trying to build a neural network that would try and predict the future positions of planets. given the kind of data Kepler had access to. Oh, okay. And the short answer very quickly I found out, absolutely not. There's no way a neural network would do what Kepler did. Kepler had access to literally a few tens of positions of the orbits of an orbit of Mars and Jupiter. Very, very little data that Tycho Brahe had collected.
Starting point is 00:07:57 And so then I ended up writing a simulation to Jens. loads and loads of data and learned how to, you know, train very simple deep neural networks to, you know, make predictions about planetary positions given, you know, years and years of data. And that was still just empirical stuff. I was, I had, you know, gone back to CS 101, taught myself coding in Python. It had been maybe almost 20 years or so since I'd done any coding, so I had to sit in a class with teenagers reaching myself. coding. That was fun. And I think at some point, what happened in that towards the end of that
Starting point is 00:08:38 fellowship was COVID happened. And we all got kind of, you know, locked up in our apartments. And my interest shifted to wanting to understand more about machine learning. It wasn't enough to just sit and do some coding. I felt like I needed to, you know, get under the skin of this thing. So I just started watching, you know, lectures from Cornell, from MIT. There was one professor at Cornell, Killian Weinberger, whom I discovered. It was a 2018 plus that he gave, which is online even today.
Starting point is 00:09:11 And it's just him giving his talks to his students. It is not produced for YouTube. It's amazing. There's nothing slick about it. It's just a professor and the students. Fantastic stuff. I got sucked into that. And kept learning more and more of the math.
Starting point is 00:09:29 and at some point I think the journalist and the storyteller in me woke up again saying, hang on, this math is actually quite lovely, that there are stories here to be told. But because I was so steeped in the math by then, I mean, again, I don't want to make it sound as if I did a lot of math. It was steeped in the math relative to the kind of math that was steeped in before, it was not very much. So this was more for me. For actual machine learning practitioners, this is pretty simple math. But for me, it was a fair bit. And I think it was that desire to communicate the beauty of the math that I was encountering
Starting point is 00:10:09 and to tell the stories about it. So some combination of getting really stuck into the math. So I remember when I made my book proposal to my editor, and you know, you and I share an editor from disclosure. We share. And I was pretty sure that he was going to say no, given how much math I was proposing to put into the book. But I made it very clear that I was going to put the math in there. I softened him up for you.
Starting point is 00:10:43 Yes, exactly. That's how I realized that later. Absolutely. So, yeah, that's how it came about. It wasn't a book that was written with a desire to. write the AI wave because this was proposed in 2020 well before all of the craziness came about. And I just wanted to share in the beauty of the map that I was encountering. And I need to dig into that Kepler story a little bit because I think it's secretly profound.
Starting point is 00:11:15 I mean, the idea, would you still think that it's true that a machine learning or some sort of deep learning algorithm given the data that Kepler had would not. be able to come up with Kepler's laws. Somehow, it seems like that must depend on the space of possible theories that the LLM or whatever it is has access to, but I'm not quite sure what's going on there. Yeah. So, well, first of all, it's unlikely that we would use an LLM large language model to solve this problem, because the data is simply, in this case, what the data that I was using is just orbital positions of planets. And I was teaching a neural network to learn about the patterns that exist in this data.
Starting point is 00:12:01 And then it was like a time series. Then it would just try to predict into the future where those planets were. And if you have enough data, you can train deep neural networks to learn the time series and make predictions going into the future. But what they don't do is they will not give you a sense. symbolic form of the equation. The equation might be there in the system, but it has no way of spitting out some symbolic form of Kepler's loss. But the network is embodying in its weights something that is very similar to what Kepler would have figured out. The question is,
Starting point is 00:12:43 how do we extract that out of that network? And so that's one problem. The second problem is what you were alluding to is that the amount of data that Kepler had access to, there's no way. Today's neural networks are extremely sample inefficient. So they require too much data to do what they need to do. And so we're certainly missing something in our AI models in terms of being able to learn the way humans do. It's also true that Kepler came with a whole bunch of prior knowledge. He was a smart fellow. So he was obviously coming with a whole bunch of inbuilt knowledge about geometry, about calculus,
Starting point is 00:13:22 and all of these things. And so we have to take that into account also. So maybe large language models, you know, which have been trained on a whole bunch of human text, have that kind of prior knowledge built in. And it would be interesting to see how one would solve this problem using a large language model. I wasn't using large language models.
Starting point is 00:13:42 This was well before LLMs came on the scene, and I was just simply training these things called LSTMs, which are recurrent neural networks, which are good for time series. I went to a seminar recently. There was supposed to be an intro to the power of AI for physicists and cosmologists in particular. And the speaker started the seminar. He had a dataset, a data set from LIGO, from a particular gravitational black hole in spiral.
Starting point is 00:14:14 And he basically had cooked up ahead of time a one page long prompt. and he fed the data set and the prompt to, I think it was a large language model, certainly a deep learning thing. And basically the LLM wrote the paper. So it wrote a bunch of Python scripts to analyze the data. It made the figures. It wrote the paper. It embedded the figures. It found the references and everything.
Starting point is 00:14:41 And it was finished by the time the seminar had finished an hour later. And I'm open, of course, you'd want to check it that didn't make mistakes, right? You definitely would have to check it very, very carefully. But I'm open to the possibility that that kind of science is doable by deep learning methods, whereas Kepler's kind of science where you're literally coming up with a new conceptualization of what's going on seems to be much harder. Yes, and I think I would very much agree with that. There was someone I was listening to just recently about this exact issue,
Starting point is 00:15:19 and he was pointing out that, you know, if we had an LLM that had data about physics that happened until 1915, right, an LLM, and then could it then come up with, you know, Einstein's theory of relativity, you know, without having anything in the data about relativity, very, very, very unlikely. I mean, I wouldn't even say unlikely. I would say impossible at this point. Right. It's just a different kind of thing.
Starting point is 00:15:48 Yeah. Owning a home is full of surprises. Some wonderful. Some? Not so much. And when something breaks, it can feel like the whole day unravels. That's why HomeServe exists.
Starting point is 00:15:59 For as little as $4.99 a month, you'll always have someone to call. A trusted professional ready to help. Bringing peace of mind to 4.5 million homeowners nationwide. For plans starting at just $4.99 a month, go to homeserve.com. That's homeserve.com. Not available everywhere. Most plans range between 499 to 1199 a month your first year. Terms apply on covered repairs.
Starting point is 00:16:18 Santa Monica College is the number one transfer college to the UCs for 35 straight years. With caring faculty, dedicated counselors, and affordable tuition, SMC helps you reach top universities with confidence. Summer classes start June 22nd. Learn more at SMC.edu. Well, okay, good. So we're here to learn about what it is actually able to do and why it's able to do it. Should we just start with going back to the beginning? and the perceptron and the first neural networks? Yeah, absolutely. The perceptron, you know, the first neural networks started in late 1950s.
Starting point is 00:16:54 So the perceptron was designed by Frank Rosenblatt. He was a Cornell University psychologist. And the percepton is essentially a single-layer artificial neural network. And the artificial neural network is simply a network of artificial neurons, an artificial neuron is very simply a computational unit. It takes in a bunch of inputs and does some sort of weighted sum of those inputs, adds a bias term, and then if that weighted sum plus bias exceeds some threshold, it will produce a one, otherwise it produces a minus one.
Starting point is 00:17:33 That was, you know, in essence, Rosenblatt's artificial neuron. and he showed how you could use a series of search artificial neurons sort of lined up vertically, so one layer of them, to do some sort of linear classification. So if you had, for instance, images of one kind of digit, let's say the digit nine, and images of another kind of digit, let's say the digit, I don't know, something that is very different for,
Starting point is 00:18:04 and if these were, let's say, 20 pixels by 20 pixels, black and white images, then each image can be effectively turned into 400 pixels. And if you were to map each pixel along one axis, then in 400 dimensional space, each of these images becomes a bunch of points. And so the image 9, all images of the digit 9 will be in one location of this 400 dimensional space and the digit forwards will be somewhere else. And as long as those things are pretty distinct and you can draw some sort of hyperplane separating out those two regions, the perceptron algorithm was guaranteed to find one such claim.
Starting point is 00:18:53 As long as the data was linearly separable in any kind of dimensional space, the perceptron would find it. And this was a big deal. This was a very big deal. In fact, you know, when we were talking of earlier of the math that inspired me to write the book, it was actually the perceptron convergence proof, which was, you know, a few years after he came up with the algorithm. So he first comes up with the algorithm empirically to how to do this. And then people get into the act of trying to figure out mathematically, you know, the properties of this algorithm.
Starting point is 00:19:27 And there was something called the perceptron convergence proof, which basically said that if the, data is linearly separable, then the algorithm will find it in finite time. And this was a huge statement to make in computer science terms back in the 1950s that an algorithm will, is guaranteed to work. And you put a lower bound in saying this will definitely work. And it's a very, very simple proof that uses just basically linear algebra. And nothing more. and if you look at the proof, it's so lovely that I think, you know, I did put it in the coda of one of my chapters.
Starting point is 00:20:09 And beginning of that coda, I tell the reader that you really don't have to read this to read the rest of the book. But I should also tell you that if I didn't, if it weren't for this proof, I would not have written the book. It's a way of teasing them into reading that book. Good. I would have, sorry, I just wanted to say that this trick. is actually due to a British novelist called Somerset Mom. He had a novel called The Razor's Edge. And it's a very interesting book.
Starting point is 00:20:42 And there's a chapter somewhere in the middle where he says to the reader, he addresses the reader directly saying, Dear reader, you don't have to read this chapter because it won't change the rest of the book. But I should tell you that if it weren't for this chapter, I wouldn't have written the book. You steal from the best. That's what we all do as we as writers.
Starting point is 00:21:01 I want to dwell on this, not the proof, but the linear separability idea, because it is kind of deep but also hard to visualize. So you're saying that I have a 20 by 20 grid of pixels, and I can think of that as a single point in a 400 dimensional space, right? which, you know, once you're, once you've done a certain amount of math in your life, that's obvious. Before you've done that amount of math, that's almost impossible to quite wrap your head around. But then the point is that all of the nines kind of cluster in a group, hopefully, in that 400-dimensional space, and all of the fours cluster somewhere else. And so if a new data point comes in, that might not be any of the existing data points, but you can say it's closer to one cluster than the other one, right? Right. So in the case of the perceptron, it will find a hyperplane. And so this will be a 399 dimensional plane that will separate out the two classes of data. And it doesn't guarantee that it
Starting point is 00:22:11 will find an optimal hyperplane. It will just find something. And because if there is a gap between the two clusters of data, in principle, there's an infinity of hyperplanes that will pass through that. So it'll find one. The first. one that it finds, it stops. And it might not be an optimal one, but then what happens is then when you give it a new digit saying, okay, tell me whether this is a nine or a four, it actually doesn't matter whether it's either, whether it's a nine or a four. It just, all it does is it going to say, is it to this side of the high plane or is it to that side of the hyperplane? And, you know, it'll classify it as such based on, you know, which side of the hyperplane
Starting point is 00:22:48 the new digit falls on. So, you know, you already identified a problem with the, with the algorithm that you could have a data that is, you know, it's much easier to think of this now. You know, the same thing applies to images of cats and dogs. You could, you could have, you know, a thousand by thousand image of a cat and of a dog. So this would now be dots in million dimensional space, right? And the hyperplane will separate the two sets of images, the cats on one side, dogs on the other. Now you bring in an image of a horse. The classifier as a idea that all it does that says which side of the hyperplane does this image fall on and it's going to call the horse either a cat or a dog. But again, going back to the late 1950s,
Starting point is 00:23:35 early 1960s, this was a very big deal because, you know, you are basically saying we can recognize images, we can classify images. And classification is the first step towards recognition. And I presume that it's not that hard to let the computer know that we're going to introduce a new category called horses, and then we're going to separate that space into three subsets. Yeah. So, you know, then you can have classifiers that are multi. They're, you know, they're classifying into multiple categories. Yes, very much so.
Starting point is 00:24:07 But the basics began with that first linear classifier. And at the same time, there was another researcher who doesn't actually get talked about as much, but whose work was just as seminal. and this was Bernie Widrow, who was at Stanford. And he was somebody who had been working on designing digital filters, adaptive digital filters. So filters that learn about the characteristics of the signal that they're processing and learn how to separate noise from signal adaptively. So they're learning on the fly.
Starting point is 00:24:44 And he was very much steeped in that and then realize that the techniques that he was using. to build his adaptive digital filters were actually exactly the same techniques he needed to build an artificial neuron that learned, a linear classifier, exactly like what Rosenblatt was doing, but Widrow's approach was very different. And in a very fundamental sense, the algorithm that Widrow came up with to build his linear classifier is actually the true precursor to today's back-properation algorithm, which is used to train artificial neural networks. And there's actually, you know, talking about stories, there's an amazing story about how the
Starting point is 00:25:27 withdrawal, it's called the Widrow-Hoff Least Mean Square algorithm. And Widrow was an assistant professor at Stanford in the late 1950s, and there's a knock on his door. A young student comes in wanting to see if he can do a PhD with Widrow. And so Bernie Widrow starts scribbling some stuff on the blackboard, trying to tell him about adaptive filters. And in the course of two hours of discussing what a PhD project would be like, they end up designing what's today called the least mean square algorithm.
Starting point is 00:25:59 They realized that they have designed an algorithm to train a simple artificial neuron. And then the duo walk across the room, there's an analog computer out there that Lockheed has donated to Stanford. And Hoff, who's a kid, looking for a PhD thesis, project, he goes and programs the analog computer to simulate the algorithm, shows that it works. And then now it's Friday evening by then at Stanford and the supply rooms are closed and they want to build this thing in hardware. So they walk across to Zach's electronics, by all the stuff that they want, go over to Ted Hoff's apartment and over the course of the weekend, build the world's first hardware artificial neuron. Monday morning they have it working, right? And they have it. And they
Starting point is 00:26:49 have a very, very crude, so you're, you know, most people who are following AI by now will know of this algorithm called back propagation and this idea of doing gradient descent and stochastic gradient descent as, you know, these are algorithms that are used for optimizing the parameters of your model. And what Widrow and Hoff had done was an extremely noisy version of stochastic gradient descent. They had used, they had come up with an algebraic formulation instead of using any calculus or anything, they just came up with a very straightforward sort of algebraic formulation that they could then implement in hardware.
Starting point is 00:27:27 So that Monday morning they had an artificial neuron on the desk working. I want to make sure that all prospective graduate students out there know this is not usually what happens. This is a rare story. And I think Ted Hoff's story is pretty amazing because once he finished his PhD, he gets an offer from a startup in the Bay Area. And he comes to Bernie Widrow to ask,
Starting point is 00:27:53 should he join the startup? And Woodrow tells him, yes, you should. And the startup turns out to be Intel. Good. Smart. And Ted Hoff goes on to become one of the first designers of the world's first microprocessor. Wow, okay, very, very good.
Starting point is 00:28:09 I want to get into the actual neuron just a little bit. This idea of a threshold is apparently very, very important here. So your little artificial neuron is taking in some signals, and rather than just sort of adding them together, it says, I'm going to hold off until it crosses some threshold, and then I'm going to fire. Is that the basic idea? Yes.
Starting point is 00:28:31 And it's inspired from our understanding of biological neurons, right? That's what biological neurons are doing. In a very simple way of looking at what biological neurons do, you have all these signals coming in through the dendrites of the neuron into the cell body. And the cell body is kind of accumulating the signals that are coming in. And when it crosses some threshold, both in terms of the strength of the signal and the timing of the signals, it then fires a signal on its own axon, which then travels to the dendrites of other neurons. And this model was already known.
Starting point is 00:29:07 And so the simple kind of artificial neuron is a very basic approximation of this biological neural mechanism. And then the real fun- The thresholding is gufurted, yes. The real fun comes in when we start adding them together in layers. Yes. So it's, and this was actually not possible in the 1950s and 1960s, because what they had was just a single layer. of neurons, which means that the inputs are coming in from one side into the neurons. The neurons do this weighted sum and then add a bias term, and then if that weighted sum plus
Starting point is 00:29:48 bias exceeds some threshold, they fire. I mean, they're always producing an output. If the threshold exceeds a certain amount, then the output becomes one, otherwise it's minus one, right? Or zero or one. Choose your output. But that's it. So on the output side, you have either one or minus one, and on the input side, you have these inputs coming in.
Starting point is 00:30:12 The neuron does this computation, the moment you add another layer of neurons so that the outputs of the first layer of neurons goes in as inputs to the second layer, then the training algorithm that Rosenblatt had and Bernie Widrow had, you know, the Widrow-Hoff-LMS algorithm, least-meanscore algorithm, they didn't work. And that, there's a very interesting story as to why this was a very big deal in the 1960s, because we had people like Marvin Minsky and Seymour Pepper who wrote this extraordinary book that was published in 1969. It was called Perceptrons in honor of Rosenblatt. And it was a mathematical analysis of these kinds of machines that learn. And in it it had the Perceptron convergence proof. a whole bunch of other things.
Starting point is 00:31:03 But they also had been a very important proof that showed that a single-layer neural network of the kind that Rosenblatt had and of the kind that Bernie Widrow had could not solve something called the XR problem. So if you imagine four points on the X, Y plane, so at the origin, at the point zero-zero, you have a circle. at the point 1-0 on the x-axis you have a triangle
Starting point is 00:31:31 then at the coordinate 1-1 you have another circle and then on the y-axis coordinate 0-1 you have a triangle so you have two triangles
Starting point is 00:31:45 and two circles but there are on the diagonals of the square good right you cannot draw a straight line to separate the circles from the triangles It's true. And so this was the XR problem.
Starting point is 00:32:00 And Minsky and Pappert had a very elegant proof saying that single-layer neural networks will never solve this problem. And this was a huge knock because this was such a simple problem that anyone looking at it can solve it. But, you know, this incredible thing that people had been going on and on about couldn't solve it. And then what they did, which was kind of underhanded, was, you know, they also insinuated. without giving any mathematical proof that even multilayer neural networks will not solve this problem. Oh, yeah. And this effectively killed, you know, well, the legend is that this effectively killed research into neural networks and led to the first AI winter. So sometime in the 1970s, research into neural networks just died because people didn't think that these things were good for anything if they couldn't solve something as simple as an XR problem.
Starting point is 00:32:51 except that they had no proof that multilayer neural networks couldn't do it. And no one had yet come up with an algorithm to train multilayer neural networks. So only people who kept the fate, or people like Jeff Hinton, who persisted. Hinton, I remember talking to him, and he was completely convinced that, you know, that multilayer neural networks will solve the problem. He thought that Minsky and Pappert had just kind of pulled a fast one over everyone's eyes. So basically, the argument there was Minsky and Pappert were very interested in another form of AI called symbolic AI. Sure. And they wanted research funding to go into that.
Starting point is 00:33:39 And this approach of connectionism and neural networks was against what they were looking at. I don't know how true that is whether there was any underhandedness here, but certainly the first day I went to was in part influenced by Ninskin and Papert's group. So MAPS played a big role. Yeah, no, I mean, the human beings, they're endlessly fascinating. But at some point, one way or the other, the multi-layer neural networks did start gaining traction. Yes, so that we would have to wait until the 1980s. So I think the first big change that happened in the early 1980s was John Hopfield coming up with Hopfield networks, which were a different kind of neural network.
Starting point is 00:34:24 They were essentially neurons that were fully interconnected, which meant that the output of a neuron would go as input to every other neuron in the network except itself. So it couldn't influence itself, but its output could influence every other neuron. And so these were fully connected networks. and Hopfield networks were essentially networks that were used to store memories. And the way they were very deeply inspired by condensed metaphysics, the icing model of sort of ferromagnetic materials and spin glasses. And what Hopfield was after was basically, he was formerly a condensed metaphysist who was looking to do something in competition neuroscience or biology,
Starting point is 00:35:15 and he was looking for a problem to solve, and he figured out that he could solve the problem of how the brain stores and retrieves what he calls associative memories by designing this kind of neural network where, given this fully connected neural network, he designed an equation which allowed you to calculate the, quote-unquote, the energy of the network. This was modeled after the Hamiltonian of a material.
Starting point is 00:35:45 And here the energy of the system would be at a minimum when you stored a memory, and any time you corrupted the network, it would enter into a higher energy state. And because the neurons were all connected to each other, the moment you corrupted one of them, in terms of changing the output of a neuron, you would end up setting up a dynamic that made the network just traverse the energy landscape and come all the way down to a minima. And when it came and settled into a minima, then you would just read off the outputs of the neuron, and they would be exactly the memory that you had stored previously.
Starting point is 00:36:21 So the way you set the coefficients of your network, which in this case are the strengths of the connections between the neurons, dependent on the memory you wanted to store. So given that you wanted to store, let's say it's a 10 by 10 image, black and white image that you want to store. And so again, 10 by 10 is 100 pixels and you want 100 neurons in your network so that each neuron is responsible
Starting point is 00:36:49 for one of those pixel values. So you take those 100 pixel values that you want to store and Hopfield had an equation that told you that given that I want to store these 100 pixels, what should be the strength of the connections between the neurons? And you would set the strength
Starting point is 00:37:07 the strength of the connections of the neurons such that when the memory is stored, the network is at an energy minima, so it is stable. It won't do anything. But the moment you perturb it, moment you add some noise, let's say you want to corrupt your image, and corrupting an image simply means that some of the outputs are now changed. Let's say there were zeros and ones in certain neurons. You just flipped them. And now the dynamics of the networks takes over, the network finds itself at a higher position in the energy landscape. And because the neurons are influencing each other, they will just start flipping,
Starting point is 00:37:41 just like magnetic moments in a ferromagnetic material, right? And then the whole system will traverse its way down the energy landscape and then settle back into its minima, which by definition is a stable state, the way the network is designed. And when it settles back into a stable state, you just read off the outputs of the neurons and you've got your image back. It may just be the world's greatest eraser.
Starting point is 00:38:05 Maybelline Instant Eraser Concealer is your secret. weapon for erasing signs of a sleepless night. Instantly cover dark circles and undereye bags in a tap, swipe, blend, leaving a bright, refreshed look without feeling heavy. Instant eraser does more than cover and conceal. With 24 shades, you can correct, highlight, or sculpt. From a subtle brow lift to defining your pout. This is the multitasker that keeps up with you.
Starting point is 00:38:31 The best part, the formula delivers flawless results for up to 16 hours with crease-resistant lightweight wear. Instant Eraser won't settle into fine lines and stays smooth, breathable, and hydrating. No cakey vibes here. Just a natural skin-like finish that looks fresh from morning coffees to late-night RSVPs. Maybelline Instant Eraser.
Starting point is 00:38:52 Find your shade of Instant Eraser concealer at your local retailer. Maybelline, New York. And obviously they're stealing ideas from physics, but we made up for it by giving them the Nobel Prize. So I think that that's the ad trade. So this was 1982, 1984. That's when Hopfield came up with this Hopfield network. But we still didn't know how to train multilayer neural networks.
Starting point is 00:39:20 What was interesting was that the ideas that would eventually influence people like Hinton were already in the air. In fact, going back to the 1960s, people in rocketry had ways of sort of optimizing their models. Like when you think about what happens when you launch a rocket, which is supposed to reach some destination in deep space, your rocket is actually going in space. You have your model of the rocket which is controlling, which is giving you the control system. And at every time step, you have to actually look at where the rocket is and then modify the parameters of your model because your model has to adapt to the fact that the rocket may not be doing exactly what you think it should be doing. So your model is being updated, your model parameters are being updated on the fly, and that in a sense had the beginnings of the back propagation algorithm already built in, except it was not formulated as such. And there were others, you know, people at MIT, there was a grad student in MIT who had done some work in economics, who was also playing around with similar ideas and many others who had done bits and pieces. but it was Hinton, Ruhmelhart, and Williams, who in 1986, put out a paper in nature.
Starting point is 00:40:40 It's just an amazing three-and-a-half-page paper that is essentially the back-propagation algorithm, which shows you how you can train a multidair neural network using what turns out to be just the chain rule of calculus. It's an extraordinarily simple idea in retrospect. maybe one thing I should mention here that remember we talked about the fact that the early neurons had a thresholding function. Right. But that thresholding function is not differentiable. So it's, you know, it just transitions steeply at the point of the, you know, when the weighted sum exceeds a certain threshold and then outputs one, otherwise it's zero. So you can't differentiate that. And this turned out to be very very. very key to not being able to train the network when you added more layers because the chain rule requires that the entire computational graph from the output all the way back to the input has to be differentiable. Every computation that you do has to be differentiable so that you can
Starting point is 00:41:45 use the chain rule to kind of back propagate your error from the output side all the way to the input side. But these discontinuities that were there in the way the artificial neurons were designed with a step threshold function kind of ruined that. So one of the things that Hinton and others did was change that step function into a sigmoid. So certainly the sigmoid was differentiable. And so they essentially ensured that the computation from the input all the way to the output, regardless of how many steps there were, were all differentiable.
Starting point is 00:42:23 So the sigmoid, remind us what that is. that's like a smoother version of the kind of step function that turns the neuron on or off? Yes. So the initial neurons had a very sharp transition from, let's say, zero to one. Okay. So the scope of the function at the point of the transition is infinite, right? And the sigmoid is essentially a smooting of that function. Okay.
Starting point is 00:42:47 So the step is gone. So it's a very smooth transition from zero to one. And let me sure I have the right mental picture here. when Hopfield has these networks where every neuron talks to every other neuron, in what sense is that multi-layer? Like, what happened to the layers? So, yeah, good question. So that was not a multi-layered network in the way we think of it today.
Starting point is 00:43:12 That was simply a fully connected neural network. I brought that up only to say that the research in neural networks had a resurgence with Hopfield networks, but hop-field networks are not the kind of networks we use today. Good. They're what are called recurrent neural networks because, you know, the outputs of a neuron can feed back into, then, you know, the rest of the network. And there's a lot of feedback going on, which is not the way networks work in today's neural networks. Today's neural networks are what are just called feed forward networks.
Starting point is 00:43:44 The computation proceeds from one end, from the input side, and then it goes layer by layer to the output side, and the outputs don't feed back to the input site. Okay, I guess it's interesting. There is this idea of back propagation. Is that part of the feed-forward network or is that excluded? So back propagation is the way you update the weights of the network when you want to train it. So the computation, when the neural network is doing a computation,
Starting point is 00:44:18 when you give it an input and it has to produce an output, That is the feed-forward process. So the input comes in from one side, and each layer does some computation, feeds the result of its computation to the next layer, and the next layer does its computation. And then finally, this exits on the output side as the output that you want. And one way to think about it is a vector of information comes in on the input side, and the vector just propagates, you know, changes size as it goes through the network
Starting point is 00:44:50 because the layers can have different sizes, and then on the output side, you'll get back another vector. And one vector comes in, gets transformed by each layer into a different vector, and then finally on the output side, you have another vector. And that output site could be a scalar. It would just be a number zero or one saying this is a cat or a dog, or it could be a vector which has more information than just a scalar. But that's the computation that's happening in the feed-forward pass.
Starting point is 00:45:20 But when you're training the network, let's say you're training a network to recognize images of cats from images of dogs. And let's go back to our example of 10 by 10 images. So that's 100 pixels. So you turn each image into a 100 dimensional vector, right? 100 pixel values. And they are fed into the network on the input side. And then on the output side, after it has gone through a bunch of transformations, you get back either a zero or a one, zero for a cat and one for a dot.
Starting point is 00:45:57 Now, in the beginning, when you initialize the network randomly, all the weights of the network, which are the strengths of the interconnections between the neurons, they are initialized randomly. And so when you feed in some image on the input side, you're just going to get the wrong answer on the output side. Sure. But you know what the right answer should be because you supply. the training data, you have human annotated data saying, these are cats and these are dogs.
Starting point is 00:46:25 So on the output side, you know that it should be outputting either a one or a zero, and it maybe does the wrong thing. So you calculate the error now, and the amount of error that it makes is a function of all the parameters of the model, all the weights of the network, or all the strengths of the connections between the neurons. So you now have something called a loss function where the loss is formulated in terms of the parameters of the model. And you can imagine this as some sort of very, very high dimensional surface. And when in the beginning when the network makes a loss, you will end up on a location in that loss landscape, which is pretty high up.
Starting point is 00:47:08 You made a large amount of noise, a loss. So you now use something like gradient descent to try and work your way down to the point in the landscape where the loss is at a minimum. And that part where you have to now figure out how much to update each weight so that your network is slightly better at the same task than it was the first go-around. So let's say you took one image and it made a certain amount of loss. You took that loss, figured out how much you need to tweak all of the, parameters so that the next time you feed the same image back, the loss will be a little less.
Starting point is 00:47:53 And if you keep doing that, eventually you'll come down to a point where the loss is at a minimum. But the trick here is that you have to do this for all images, because if we just did that for one image, then you'll be off for all the other images. So you have to do it in parallel for everything simultaneously. So for your entire data set, you want to reach the bottom of the loss landscape. And that part where you are trying to update the weights of the network, that's the back. propagation part. It's all back propagation because the loss is calculated on the output side and now you have to propagate that loss all the way back layer by layer so that you can
Starting point is 00:48:29 update the weights of each layer as you go back towards the input site. That's very helpful. Yeah. Because that's where the table comes in because you calculate the gradient on the output side and then you have to chain all the, you know, differentiable computations together so that you can calculate the gradient or the entire network. It makes sense because there's a difference between training the network where obviously you're going to have to go backwards and fix all the parameters early in the chain of layers versus just doing the calculation once you've trained it,
Starting point is 00:49:04 which is a purely feed-forward mechanism. Yes. And so for people who are, for instance, using something like chat GPT today, when we use it, we're just using the feed forward part of it. But when you're training it, you have to keep doing this back and forward. And say more about the idea of gradient descent. It seems to me, maybe I'm not sophisticated enough here, it seems to be like it's a very high dimensional version of something that Isaac Newton
Starting point is 00:49:32 would have told us about many years ago. Well, that is the amazing part about this field, is that a lot of these ideas go back all the way to, you know, the invention of calculus. and in algebra, these are all 16, 17, 17th, 18th century math. Very simple stuff in some fundamental sense. And yet, of course, now when we do gradient descent, we are doing it in an extraordinarily high-dimensional spaces.
Starting point is 00:49:56 When you think about modern deep neural networks, like the large language models that we have, you know, GPD4 or, you know, OpenAIs 01 or 03 and Claude, all these are quite, We don't know exactly what the number of parameters they have, but they're close to a trillion, right? So your loss that the error that the network is making has to be formulated as a function of these trillion parameters. Right.
Starting point is 00:50:26 So your lost landscape is in trillion dimensions. Right? And you're trying to, and the other thing that I should have mentioned earlier when we were talking of these artificial neurons is that the early, The early artificial neurons were linear neurons. And, you know, so the subsequent sort of artificial neurons have a non-linearity built into them. The addition of a sigmoid essentially makes, you know, makes the neuron nonlinear. So now your loss function is not only does it have a trillion, you know, it's a function of a trillion parameters,
Starting point is 00:51:07 but there's a huge amount of non-minarity bigged into the whole system. So it is not a convex function. It's not just, even if it is in trillion dimensions, you can imagine a convex function, some trillion dimensional bowl-shaped function that you can just slide down all the way and you're guaranteed to find the global minimum. That's not the case with these modern neural networks.
Starting point is 00:51:31 These lost landscapes are extraordinarily complex. They have probably, we don't even know for sure that they have a global minimum. But they may have lots and lots of very good optimal local minima. And so the trick is to somehow use gradient descent to find one of these local minima that is satisfactory. And not a trivial task. Yeah.
Starting point is 00:51:58 But the math is the same, regardless of whether it's a trillion parameters or four parameters. That's the amazing part. the back propagation algorithm, what Hinton did in 1986 in their paper, Romal Hart, Hinton, and Williams, the same stuff holds true today. And they had, I don't know,
Starting point is 00:52:15 I forget how many parameters they had, but you could tens or 10, 10, 2030 or something like that. And today we're talking about trillion, but the algorithm is the same. And of course, there are all sorts of innovations about how you do the stochastic, how you do the gradient descent.
Starting point is 00:52:32 You can do stochastic gradient descent where you don't, you know, sort of you take a small batch of the data that you're training on at any given time. And so the lost landscape that you calculate is always some approximation of the true lost landscape. And so when you do the descent, you are sort of stochastic in terms of whether you're actually going in the right direction or not. But it turns out, sarcastic grading descent works brilliantly. And so there are these kinds of innovations that have happened since the 1986 paper. There are also other tricks about how fast you do the gradient descent. Do you use momentum?
Starting point is 00:53:12 Do you keep track of the gradient at previous steps? When do you slow down? When do you speed up? You know, those kinds of things. And they're really important in terms of engineering. But conceptually, it hasn't changed. Life with CIDP can be tough. But the Thrive team, a specialized squad of experts,
Starting point is 00:53:31 helps people living with CIDP, make more room in their lives for joy. Watch Rare Well Done. In all-new reality series, Rare Well Done offers help and hope to people across the country who live with the rare disease CIDP. Watch the latest episode now, exclusively on Rare Well Done.com. It may just be the world's greatest eraser. Mabeline Instant Eraser Concealer is your secret weapon for erasing signs of a sleepless night. Instantly cover dark circles and under-eye bags in a tap, swipe, blend,
Starting point is 00:54:09 leaving a bright, refreshed look without feeling heavy. Instant eraser does more than cover and conceal. With 24 shades, you can correct, highlight, or sculpt. From a subtle brow lift to defining your pout. This is the multitasker that keeps up with you. The best part? The formula delivers flawless results for up to 16 hours with crease-resistant lightweight wear.
Starting point is 00:54:31 Instant Eraser won't settle into fine lines and stays smooth, breathable, and hydrating. No cakey vibes here. Just a natural skin-like finish that looks fresh from morning coffees to late-night RSVPs. Mabelene Instant Eraser. Find your shade of Instant Eraser concealer at your local retailer. Mabelene, New York. And so just so I think that I'm getting the right visualization here, because I know there's a trillion-dimensional thing, but I'm visualizing a two-dimensional landscape, because that's all I'm can do in my head. And the impact of the non-linearities, which you're emphasizing, I think would be
Starting point is 00:55:13 that if you just had linear would be a tiny change in input always gives you a tiny change in output. And therefore, your fitness landscape that you're trying to find the minimum of would be pretty smooth. It would be gently rolling hills. But now that you have these neurons that can sort of click on and off in a non-linear way, now you have a jagged landscape and it becomes much, much to sort of know from where you are and what your local conditions are where your actual minimum is lying. Yeah. And because of the size of these networks and the complexity of the computations that happen layer by layer, it's not even clear that there is a global minimum. These are not, you know, if you can think of a, you know, Y is equal to X square function, which is convex and you have a well-defined global minimum that you can
Starting point is 00:56:05 descent down to and that will represent the lowest loss, we are not guaranteed that these functions are convex. So not only are they jagged, they may not have a global minimum. So most of the sort of mathematical work that is going on is a lot of it isn't trying to figure out the actual nature of these lost landscapes. But okay, despite the fact that a lot of the math is sort of classical math, you like it's good old back there. There are some more recent developments, right? I don't even, I guess I shouldn't say that because I don't know what dates different developments correspond to.
Starting point is 00:56:46 But you do talk about the curse of dimensionality. The number of dimensions is so large that one of the things you try to do is sort of find a good subspace where interesting things are happening using ideas like principal component analysis. Maybe you could make an effort to explain that to everybody. So the curse of dimensionality actually has been well known for a long, long time. It predates or is almost orthogonal to neural networks in terms of a problem. One of the best ways to understand this is a very classic machine learning algorithm that was developed in the 1960s called the nearest neighbor search algorithm. So here again, we can go back to our images of cats and dogs. Let's say images that are 10 pixels by 10 pixels. So each image is essentially 100 pixels.
Starting point is 00:57:45 So if we map it into 100 dimensional space, cats end up in one location, dogs end up, and dogs end up in another location. And now we're given a new image. So the perceptron would have started by saying, oh, it's going to find a hyperplane that separates the two clusters of data and then checks to see which side of the hyperplane the new image falls, and then it classifies it as a dog or a cat. The key nearest neighbor algorithm does something actually intuitively very simple. It basically says, okay, where does this new image map to in that hundred dimensional space? Is it closest to a dog or is it closest to a cat? If it is closer to a
Starting point is 00:58:23 dog, it's a dog. If it's closer to a cat, it's a cat. Right. And if you're just using one neighbor to make that discernment, then you can end up in all sorts of, you're essentially overfitting the data because you can have noise in your data, original data. And if your new image is closer to a noisy data point or a mislabeled data point, then you will get an error. So you can mitigate that by saying, okay, I'm going to look at three neighbors, or I'm going to look at seven neighbors or whatever, some odd number of neighbors, and then you take a majority vote, right? But that process depends on being able to calculate some sort of distance between these data points in high-dimensional spaces. So 100 dimensions is not high-dimensional for machine learning.
Starting point is 00:59:08 So you're basically going to use some sort of, let's say, auclidean distance or, you know, there are various metrics you could use, but let's say you're using some sort of, you know, Euclidean distance is fine. Let's say you use that. And then what happens is as you increase the number of dimensions of your data, then there comes a point where the notion of similarity or dissimilarity that this algorithm depends on, that similar data points are closer together than dissimilar ones, that starts falling apart because in high dimensions, everything is as far away as everything else.
Starting point is 00:59:52 So the notion of similarity and dissimilarity just doesn't work anymore. And that in a very simple way is the curse of dimensionality. And this is a serious problem for machine learning because, you know, the number of dimensions is also the number of features that you have in your data. So in this case, each pixel is a feature of your image. But you could have other kinds of data where, let's say, you're thinking about penguins and you have, you're trying to analyze penguins by looking at the length of their beak,
Starting point is 01:00:25 the depth of their beak, or their flipper length, or et cetera, et cetera. A penguin can be characterized by, I don't know, 10 such features, and you have a pretty good idea of what kind of penguin it is. But there will be situations where 10 is not enough, where you probably need 10,000 features to figure out what might be happening in terms of classifying an object. And or even more, and the more and more features you add, you end up with this cursor dimensionality.
Starting point is 01:00:50 you're just not going to be able to, you know, do what you want. So one obvious thing is to do something like PCA, principal component analysis, where you essentially project the data back into lower dimensions and hopefully capture most of the variance in your data along the few lower dimensional axes that you've chosen. Right. So if the data varied equally in all of the higher dimensional axes, then you're in trouble. You'd be stuck.
Starting point is 01:01:25 But if you can, if there is something about the structure of your data such that if you do bring it down into lower dimensions and it still captures most of the variance of your data along those fewer dimensions, then you can take that lower dimensional data and do your classification on that. You could train your machine learning model on the lower dimensional data. So that's one way of tackling. But, you know, oddly, higher dimensions are also really, really useful. So for instance, if you have data that, let's say, is 100-dimensional, and you cannot build a linear classifier because the two clusters of data are not linearly
Starting point is 01:02:07 separable in 100 dimensions. Or let's not even go into 100-dimensional. Let's just take two-dimensional data, right? a smattering of dots on an XY plane. One is colored red and the other is colored green, but the red and green are kind of mixing up such that you cannot draw a straight line to separate the two. Now, one easy trick would be to actually project this data
Starting point is 01:02:31 into higher dimensions, so that, let's say, you go into three dimensions or four dimensions. For instance, if your red-colored dots are centered around the origin and your green colored dots are an annular ring around the red color dots, there's no straight line that's going to separate the two clusters in two dimensions. But you can imagine adding a third dimension where you're just multiplying the X coordinate and the Y coordinate to create a Z coordinate, right? Or other, will multiplication be enough?
Starting point is 01:03:06 Maybe not. You'll have to square the X coordinate. coordinate and add it to the square of the y coordinate. And then when you plot this data in three dimensions, the green dots are going to rise above the red dots. And then you can draw on hyperplane between the two. You can use a linear classifier to separate out the green dots and the red dots in three dimensions.
Starting point is 01:03:30 And then once you've figured out what that separation is, when you project it back into two dimensions, you will get a sort of nonlinear curve, which is going to be some sort of oval shape that separates out the annular ring from the dots in the center, right? So this is something we can visualize in two dimensions and three dimensions, but this exact thing is also done using this extraordinary technique called kernel machines or kernel methods where the idea is that you want to project your data into high dimensions. And if you want to find a linear classifier in high dimensions, the algorithm requires you to take dot products of vectors of the data points. And as you go up to higher and higher dimensions,
Starting point is 01:04:14 the dot products are computationally expensive. Okay. Right. I mean, because let's say you started off with a hundred dimensional data, which is technically low dimensional and you project, and you can't find a linearly separating hyperplane and you project it into a million dimensions. Well, you can find that separation, right? But now you've basically moved your computation of dot products from a hundred dimensional space all the way into million dimensional space. And you're essentially creating a computationally intractable problem. Kernel methods are this amazing technique where you have to find a function that takes in two low dimensional vectors and spits out a number that is equal to the dot product
Starting point is 01:04:57 of the corresponding two vectors in the higher dimensions. So let's say there is a vector X in the low dimension. which maps to a higher dimensional vector fee of x. And there's another vector y in lower dimensions which maps to another higher dimensional vector which is fee of y. So in the higher dimensions, the dot product that you would need
Starting point is 01:05:23 is fee of x dot fee of y. And that might be million dimensions, dot million dimensions, very expensive. You have another function called kernel. It's called k. So you feed in the two low dimensional vectors and it spits out a number that is actually equal to fee of x.
Starting point is 01:05:41 A fee of y, except you've never gone into the million dimensions. You're operating in the 100 dimensional space. Your function just takes in 200 dimensional vectors and gives you a number a scalar, which is equal to the dot product in the higher dimensional space. And once you have this, you can now run your linear classifier in million dimensional space
Starting point is 01:06:05 without ever stepping into the million-dimensional space. And the amazing thing about this kernel methods is that you can project it to infinite dimensions where you're guaranteed to find a linearly separating hyperplane. And technically you can never even write down what a million-dimensional, infinite-dimensional vector is going to be computationally. But your kernel function will take two lower-dimensional vectors,
Starting point is 01:06:31 give you a scalar value that is the dot-product of two infinites, dimensional vectors in infinite dimensional space. So your linear classifier now is operating in infinite dimensional space where it finds a hyperplane and then when it's projected back into your lower dimensional space, you've found some very intricate nonlinear boundary. It's interesting to me how much of the effort needs to go into these sort of speeding up processes. Like you might find an algorithm that would do amazing things, but if it takes 10,000
Starting point is 01:07:01 years to run, it's not going to be very helpful to you in the real world. No, I think that's what's amazing about this whole enterprise is there's so much engineering chops that is needed and math-informed engineering. And I know that one of the big papers that really made a revolution more recently was the transformer architecture. And I've made a little bit of effort to understand what that means, but I've kind of failed. Is it possible for you to explain why that was important? Yeah, so you're talking of the attention is all.
Starting point is 01:07:35 all you need paper, 2017. Yes, yes. So, yeah, so I'll do my best to see how to explain that. It is an amazing paper when you think about, you know, one paper which changed the course of AI, right? This is not to say that the paper came out of the blue. I mean, there was a lot of work that was happening that led up to that paper, but it was a very transformational paper because I think maybe it's almost simpler. to just talk of what the transformer is
Starting point is 01:08:06 rather than the paper, right? So the way a large language model works is that let's talk about the training process. You take a sentence. Let's say, this is a sentence I keep taking in my talks. The dog ate my homework. Right? And you blank out the last word, homework.
Starting point is 01:08:35 So you have the first four words, the dog ate my blank, and then you feed it to your model, and you're asking you to predict what is to follow. Now, traditionally when we do next word prediction, we tend to look at, you know, before all this happened with LLNs, we were kind of looking at maybe the previous one word or two words in order to predict what the next word might be. If you just looked at the last word in that sentence that you had, the dog ate my. If you looked at my and you tried to predict the next word, you would be completely wrong because you would be saying something like my poem, my dog.
Starting point is 01:09:14 It could be anything, right? Just about anything. So the model will have no idea, you know, how to predict the next word. If you took two words, the dog ate my. And if you looked at eight and my and then said what should be the word to follow eight to my, you'll probably say lunch or dinner or something, something, again, completely wrong in the context of this sentence. It's only when you look at the word dog that you realize,
Starting point is 01:09:40 oh, that should be, we know that this is a very popular sentence for children as an excuse to their teacher about why they did it do their homework. So the dog ate my homework. So what happens when you feed these words to a large language model is the first thing that the AI does is it turns these words into vectors. So they're called embeddings. So each of these words, each of these four words are turned into vectors in some high dimensional space, let's say, you know, a thousand dimensional space.
Starting point is 01:10:16 So and these vectors then just flow through the deep neural network that is the black box that is being called the transformer. And what it has to do is, like, if it was just looking at the final vector, which represented the word my, and using that to predict the next word, it'll probably get it wrong. So as those four vectors flow through the deep neural network, which is the transformer, it has to contextualize it. It has to keep massaging those vectors such that the words start paying attention to each other. So the four vectors are just simply moving through the layers of the network. And at each layer, the four vectors have changed such that they capture something about each other. And so they're kind of, and hence the term attention, they're paying attention to each other.
Starting point is 01:11:13 So after the first transformation, maybe the four vectors have changed enough that you are predicting something that is close to homeward, but not quite. And then as you keep going through the transformer layers, the final analysis at the very end, you still have four vectors. But now the fourth vector has so much information, contextualized information. It knows that it has paid attention to all the other words, and the vector has changed such that now the LLM can say, oh, I can look at that last vector and I know that the next word should be homework.
Starting point is 01:11:49 And the attention mechanism is essentially the process that allows the transformers to contextualize these vectors. And it's a whole bunch of matrix manipulations. It's just very, very neat matrix math going on. And then you just split out these four vectors at the end, you just look at the final vector, which is the vector for the word my. But now it has knowledge about the fact that it had paid attention to eight and dog and all of that and it can allow you to make the prediction that next word should be homework. During training, of course, it will make an error because all of the weights of the network
Starting point is 01:12:28 are randomly initialized. The matrix stuff that the transformer is doing has to be learned. It needs to learn what it has to pay attention to given a certain sentence. So in the beginning when it predicts a word, it might predict something completely wrong. In fact, it will predict something completely wrong. you know what the right word should be. So what the language model is predicting at the very end is a probability distribution over its vocabulary.
Starting point is 01:12:56 It's basically saying, oh, if my vocabulary is a thousand words, then here's the probability distribution over my vocabulary as to what is the most likely next word. And it's going to get it wrong in the beginning. But you know what the correct probability distribution over your vocabulary should be. It should be one for the word homework and zero for everything else. and so you then calculate an error, and that error is a function of all of the, you know, 500 billion or a trillion parameters in your large language model,
Starting point is 01:13:28 and you do back propagation all the way through your networks to fiddle with the weights of the network, so that the next time you give the same sentence, it predicts a word that is a tiny bit closer in that probability distribution space to the word that you want. And so as you keep tweaking with every back propagation step, your network will get better and better at predicting the fact that the next word should be homework. But that's just for one sentence. Now, imagine doing this for every sentence that you scribed off the internet.
Starting point is 01:14:02 And that's why training these language models takes months. Yeah. And that was great that you did it. I think I do understand. I think this is the first time in my life. So thank you very much for that. So we're near the end of the podcast. The final question is going to be a completely unfair one.
Starting point is 01:14:19 So answer it to whatever level you want to answer it. Given that you studied some of the math, some of the history of how these have gone, what is your feeling about the future of progress in these kind of AI landscapes? Is it more just going to be scaling and we have more computing power, more data? or is there some conceptual leap out there remaining to be made that's going to make everything very different? My sense is that scaling, well, scaling what? So currently when we talk of scaling things up, we're talking of scaling up large language models. And large language models, the reason why scaling them up alone will not get us to any kind of generalized intelligence is potentially because,
Starting point is 01:15:09 A, we have no mathematical guarantee that a language model is 100% accurate. You cannot guarantee accuracy, right? Because the output is, it's outputting a probability distribution over its vocabulary. With every forward pass, that's what it produces. It produces a probability distribution over its vocabulary, and then you sample from that distribution. So there is an in-built stochasticity there. And there's no mathematical guarantee that the probability distribution it produces, even if you
Starting point is 01:15:46 sample the most likely next word out of that distribution, is going to be the word that you want, right? Or token. So scaling up alone is not going to get us to a place where we are 100% sure of the accuracy of the model. The other problem with large language models is that they are extremely sample inefficient. They require enormous amounts of data to get to where they are, right? And the reason why scaling has worked so far is because this entire process of training a large language model is more or less hands off in terms of human inputs.
Starting point is 01:16:21 You know, you just scrape some amount of text information from the internet. You, you know, mask the last word and ask the network to learn how to predict the next word, right? Yeah. With the mask word. And that's a completely, a process that can be completely automated. And hence, I mean, able to scale. scaling up. And they've managed to do that now for a long time, and the results are pretty amazing. But given its sample efficiency, given that it has no guarantee of correctness,
Starting point is 01:16:51 even though they're getting much better at being correct, but there's no guarantee. It's an asymptotic thing. So we're not going to guarantee 100% accuracy. Given those two things and other concerns, I think most people in the field are expecting some sort of, you know, Something similar to what happened with the attention is all you need paper. That paper changed everything. We're probably one or two steps away like that from an AI that is capable of generalizing to questions that it hasn't seen. Answering things are questions about patterns that don't exist in the training data. So effectively, going back to our early argument, doing what Kepler did.
Starting point is 01:17:34 Yeah. Right? And LLMs are not those kinds of. systems, very unlikely. But you never say no with these things. But my hunch is that were two or three breakthroughs away from something quite transformative. I like that. That gives the youngsters in the audience something to think about and something to try to do. So, Neil Alainat Swami, thanks so much for being on the Mindscape podcast. This was great. Thank you, Sean. It's my pleasure.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.