Microsoft Research Podcast - Will machines ever be intelligent?

Starting point is 00:00:01 This is the shape of things to come, a Microsoft Research podcast. I'm your host, Doug Berger. In this series, we're going to venture to the bleeding edge of AI capabilities, dig down into the fundamentals, really try to understand them, and think about how these capabilities are going to change the world for better and worse. In today's podcast, I'm bringing on two AI researcher experts. Niccolo Fuzzi, who is an expert in digital, transformers, former-based large-language model architectures and learning,

Starting point is 00:00:37 and Subutai Ahmed, who is an expert in biological architecture, specifically the human brain. And the question we're going to discuss is, are machines intelligent? And what I mean by that are digital intelligence, large-language models on a path to surpass humans, or are the architecture just so fundamentally different that one will do one set of things well, the other will do something else very well.

Starting point is 00:01:04 And so we'll be debating the architecture of intelligence across digital implementations and biological implementations. Because the answer to that question, I think, really will determine the shape of things to come. I'd like to ask each of my guests introduce themselves, tell me a little bit about your background, and what you're currently working on to the extent you can talk about it in AI. So Niccolo, would you please start? Yeah. Thank you, Doug for having me here. It's so much fun. So I'm Nicola Fuzzi. I'm a researcher at MSR. So Doug is my boss. So I will be very, very, very good to Doug in this podcast. No, but jokes aside, my own background is in, based on parametric. That's what I started studying. So Gaussian processes and things like that. And then equally, I would say, in computational biology.

Starting point is 00:02:05 because I found it like one of the most interesting use cases for AI techniques. And that kind of has been through throughout my career. And pretty much like everybody else, eventually I moved away from the kernel methods and the Basingenumparameterics, and I started working more on language models, transformer models, with a particular eye towards information theory and the connection between information theory and generative modeling. And that's kind of one of the main things I do today

Starting point is 00:02:36 other than kind of managing the research of people who do much more interesting work than I do. I have to interject there, Niccolo, because you dragged a piece of bait across my path. In Microsoft research, I have a management rule that I can't tell anyone what to do because we hire some of the best people in the world. You have to trust them.

Starting point is 00:03:01 And everyone is always completely free to call BS. on me. And so Nicola was joking there. He does not have to tow the party line. In fact, I encourage him not to. So, so thank you. I just have to be well-behaved. That's the only thing I will say. Yeah, thank you. Thank you for baiting me because he knew exactly what he was doing. And I love him for it. Subutai, can you tell us a little bit about yourself? Sure. Thank you so much, Doug, for having me. I'm really looking forward to the conversation between us all. So I see myself fundamentally as computer scientist. You know, I've been studying computer science for longer than I care to admit. And but what something changed from me during my undergrad years, I decided to minor in

Starting point is 00:03:46 cognitive psychology. And I started to get really interested in how the brain works. And to me, understanding intelligence and implementing intelligence was the hardest problem a computer scientist could ever solve. So I got very, very interested in that. You know, I couldn't see how to really commercialized that. I was very interested in making products and stuff. So I stopped, you know, working on that for a while. I did a number of startups doing computer vision, you know, video processing, a lot of that stuff. And then when Jeff Hawkins started Numenta back in 2005, with the idea of really deeply understanding how the brain works and figuring out how to apply that to AI, for me, it was like all my worlds coming together. This like, this is what I had to do.

Starting point is 00:04:30 None of us thought it would take as long as it did. We spent the last couple of decades really deeply trying to understand neuroscience from a computer scientist, from a programmer's standpoint, the underlying algorithms. And that's really what I'm passionate about, just trying to translate while we understand about the neuroscience to today's AI. And in terms of what we're working on today, it's, you know, the human, maybe we'll get into some of this. The brain is super efficient and how it works, power efficient, energy efficient. And we're trying to embody those ideas and trying to make AI a lot more efficient than it is today. Great. I think we'll get into efficiency a little bit later in the podcast because that's a subject that's near and dear to my heart,

Starting point is 00:05:12 being a computer architect originally by training. I want to go back to one of the reasons I got involved with Numentas. Subutai and I have been exchanging emails, like discussing collaborations, you know, visiting each other throughout the year, through the years. And the thing that really stuck with me was when I read one of the earlier books from Jeff on intelligence. And there was an example in the book that talked about how, you know, the human brain learns continuously.

Starting point is 00:05:45 I think biological organisms in general learn continuously. And the anecdote that I remember was this anecdote if you're walking down your basement steps, you know, you're walking down the stair to your basement. and there's one step that's always been a few inches off and you decide to fix it. And so you raise it so it's even with the others. And then the next time you go down the stairs, you don't remember, and you're wildly off and you hit that step, you hit it earlier or later than you anticipated. You go out of balance, you're flailing around, you get all this adrenaline, and you think

Starting point is 00:06:14 you're going to pitch head first down the stairs, hopefully you don't. And then the second time you do it, you're a little off balance, but it's not crazy. And the third time, you maybe notice it a little bit. And the fourth time, it's like it's your basement stairs. And so somewhere between that first time down and the third and fourth times down, there are molecular changes in your brain that have learned the new timing of your basement steps. And I remember just that example vividly from the book, and that got me thinking, wow, this is so different from the way our digital AI works. I just, I'll turn it over to you to comment for that. And I think we'll go into the digital.

Starting point is 00:06:52 Yeah, now that's a great example. I think it's remarkable how our brain is constantly modeling our entire world at such a granular level. And we're not even aware of it perceptually. Like, you know, that example of the steps is probably not, you wouldn't consciously be aware of it. Yet, if something is different about anything in your world that you're very familiar with, you'll instantly notice it. And then you'll, you know, you'll update your world model. You'll adjust and you'll continue on. it's really remarkable how the brain's able to do that so seamlessly.

Starting point is 00:07:27 And a lot of that is based on neurotransmitters, right? Because there's just a, you know, when you have that physical reaction to I'm about to pitch down the stairs, you get a flood of transmitters that actually changes the way your brain's learned, or at least the rate. Yeah, there's a flood of neurotransmitters and neuromodulators as well that invoke change sometimes very rapidly. Another example, you know, if you touch a hot stove, that's the canonical example. you will learn that very, very quickly.

Starting point is 00:07:55 So there's a lot of chemical changes that happen, but it's also really interesting that we can update things and update our world knowledge without impacting everything else that we know. This is something that's very, very different, again, from today's AI models. We're able to make these changes in a very contextual and very sort of fine-grained way. So Niccolo, I want to go and talk a little bit now to Transformers.

Starting point is 00:08:20 So I think, you know, you and I and Subitai have, we're all working in the AI field, you know, many years before 2017 when the transformer hit. You know, I was building, you know, with my team, hardware to accelerate RNNs, LSTMs, you know, which had this awful loop carry dependence, you know, the bottleneck computation. And then the transformer was just much more parallelizable. So what do you think is really? going on in these things. And maybe we could start, I know you and I've talked a lot about this. Maybe you should start with the major blocks. You know, you've got, you've got the attention later, you've got the feed forward later, you've got the encoder stack and the decoder stack, and the latent space in between. Can you just kind of walk us through those pieces at a high level

Starting point is 00:09:10 and tell us what you think is going on? Yeah. I mean, I have a very opinionated view of why transformers are so great. That's why you're here. Maybe maybe like maybe like, maybe like, Maybe I'll inject it. I don't know. I don't know if it's a super novel creative opinion, but it is an opinion. So I guess the two main components you already described, the transformer layers and the fit forward layers.

Starting point is 00:09:34 One way to think about them is how does information in your context relate to each other? And how do I, what is every token referring to? For instance, in the case of transformers in language models. So by context, we mean the information you fit through the model, that the model continues generating and appending to. So like your chat history. Your prompt.

Starting point is 00:09:55 So your chat history or your particular prompt in the chat session. That prompt, which is a sequence of words, get discretized in a series of tokens. Tokens can be individual words, can be multiple words kind of connected together. The way we go from words to tokens typically is through an algorithm that tries to basically collapse as much as possible multiple words like the dog, maybe just one token. as a first kind of level of compression to feed into the model. So it just tries to bring things together as efficiently as possible. Then there is that, you know, within these models,

Starting point is 00:10:30 there is a transformer layer, this transformer layer, or this attention layer, sorry, tries to basically figure out what the, the refers to, the term the in the dog, or the dog jumps on the table, jumps refers to the dog. So that, there is this kind of like, mapping that happens. And then there is like fit forward layers which in modern large language

Starting point is 00:10:56 models, they store a lot of information. Like that's kind of like where the knowledge typically kind of sits in. The things that the model just knows, you know, that I don't know, that if you, if you slam your arm against a cup of water on your table, that cup of water falls off the table, that's something that the model kind of has baked in through reading a lot about cups falling off of tables when they're hit. So that's kind of, those are, for me, the two fundamental components. And the reason why I have an opinionative view is that, you know, honestly, I do believe that RNNs and, you know, even state-based, modern incarnation of state-based models are good enough to learn over these, you know, language data or whatever. vision data or audio data. The good thing about Transformers is that they do two things very well. One is they get out of the way. They don't have this notion of everything has to be encoded through a

Starting point is 00:11:56 state like the recurrent networks. And two, they do that very computationally efficiently, as you were saying. There isn't a computational bottleneck. And so they created this nice overhang where they happen to be the right architecture at the right time to unlock enough flow of information through the model that we could get through these amazing things. Let me press you on one thing. In the attention blocks, you can figure out which words or which tokens relate to which tokens. So I put in the prompt, and it's finding all the relations and then feeding those relations up to the feed-forward layer, well, the feed-forward unit within a layer. And you said that knowledge is encoded there, but then what does it really mean for those maps to then access knowledge, but then you project it back into.

Starting point is 00:12:45 you know, the output and then feed it up to the attention block in the next layer. So it seems kind of weird that I'd be accessing knowledge and then taking that knowledge, merging it and going back to another attention map. Well, you can see there's a mixing operation that happens in the feed-forward part of the layer. You know, like you're attending, then you're mixing and kind of like reprojecting to some space with higher information content or like a different level of information extraction. And then you're putting it back into, okay, so let me do another run of processing and kind of attending and then I mix again and then I do it again and then I do it again.

Starting point is 00:13:30 So I think that the information that is present in the prompt and that has been baked into the weights get further and further refine, whether that refinement is extraction of structure, or addition of aggregation into higher level concepts. I'm not sure. I think it's just structure gets extracted and things that are irrelevant get kind of pushed away. But that doesn't necessarily mean that it gets aggregated through the architecture.

Starting point is 00:13:59 So now I'm going to try to restate what I think I hear you saying. So we're adding information and we're kind of adding information at a higher level, but not necessarily throwing away the low-level information, at least that's not relevant. Right, because if the higher level stuff depends on the low level stuff, I have to have that first. And so then you get to the top of the encoder block and you're in the latent space with all of that information kind of maximized. Is that a way to think about it?

Starting point is 00:14:27 And if you agree, can you talk about what the encoder block really is and what the latent space is? I tend to agree. Yes. I mean, you're describing, I think you're describing what I think is happening, which is. there is, given the context in your prompt and given the task the model perceives or like figures out that you're doing, it has to highlight and pull out the relevant information. And it does that not by summarizing layer by layer, but it does it by, you know, increasing the prominence of that information and suppressing other things.

Starting point is 00:15:09 So I think that's ultimately what happens. up to the point where you reach these beautiful space, pointing concept space, which identifies both your intent and the things in the prompt and in the knowledge of the model that are necessary to solve it. And so one last question, and then I want to go to Subotai for a second. So now when we go through the decoder stack, are we just going the other way and stripping out the high-level concepts early

Starting point is 00:15:37 and getting down to the granular tokens? or, you know, because you go up through the encoder stack, those attention blocks and feed forward layers, to get to that magical latent space. And now we're going to go the other direction. How do you think about that other direction through the decoder stack, which is the same primitives as the encoder stack?

Starting point is 00:15:55 Same primitives. You can think of it as kind of the diverse operation. Like you never lost information throughout. You just kind of suppress or privileged different kinds of information. And now you're basically just projecting it back out to a space that is, you know, intelligible. And it's kind of where the model gets. I hesitate to use the term reward because it has a particular implication, but that's kind of where the loss gets computed and then gets pushed back through the model. Right. As you're trying to evolve and train all those

Starting point is 00:16:32 parameters, the relationship between words, the information and the feed-for layers, the design of that latent space and the extraction of the knowledge from it. That's right. And so in an encoder-decoder model, you push through the whole thing, you decode back to a particular token, which for people who don't know, it's like literally a number out of a vocabulary, like word number 487. And if it was word number 1,500, you get, you know, like, it was a bad reward. Yeah.

Starting point is 00:17:03 Yeah. And if you got it right, you get a positive signal. and then just flows back through the model. I'd like to go over to Subutai. Now, so after hearing this, you've studied, you know, neuroscience and the neocortex and cortical columns and all of this for a long time, and you and I have had lots of debates. Is the human brain doing something different than that?

Starting point is 00:17:26 You know, are we just building latent spaces than extracting? The architecture is very different, but what's going on under the hood? Yeah, the architecture is very different. you know, as Niccolo was describing what happens throughout a transformer stack, I was trying to relay and relate, you know, what we know in the brain as well. In a typical, you know, transformer model or there is, at the end of the day, there is a single latent space from which the next token is output. That does not happen in the brain.

Starting point is 00:17:58 There are thousands and thousands of latent spaces that are sort of collaborating together, if you will. you know, a lot of what we publish is under the moniker, the Thousand Brain's Theory of Intelligence, and Jeff has published a book a few years ago on that. And that kind of dates back to discoveries in neuroscience from the 60s and 70s by the neuroscientist, Vernon Mountcastle, who was a professor at Johns Hopkins. And what he discovered, he made this remarkable discovery that, you know, our neocortex, which is the biggest part of our brain, that's where all intelligent function happens.

Starting point is 00:18:32 is actually composed of roughly 100,000 what do you call cortical columns. And each cortical column is maybe 50,000 neurons, and there's a very complex microcircuit and microarchitecture between the neurons in a cortical column. But then there's 100,000 of them in every part of your brain, whether it's in visual processing, auditory processing, language, thought, motor actions. They're all composed of essentially the same micro-architecture.

Starting point is 00:19:06 And this was a remarkable discovery. It says that there's a universal architecture. It's not a simple one. It's complex, but it's repeated throughout the brain. And that's where the idea of the thousand brains, each of these cortical columns is actually a complete sensory motor processing system. It has inputs. It has outputs.

Starting point is 00:19:28 It's getting sensory input. It's sending outputs to motor systems. And it's building, in our theory, complete world models. So there isn't a single latent space. There's thousands of these latent spaces. And each little cortical column is trying to understand its little bit of the world. You know, one cortical column might be getting at the lowest level, maybe one degree of visual information from the top right hand corner of your retina.

Starting point is 00:19:55 Another one might be focusing on specific frequencies in the auditory range. You know, each one has its own little view of the world, and it's building its own little world model, and then they all collaborate together. There's no top or bottom here. There's no homunculus in the brain. Everything is sort of equal, and they're all simultaneously collaborating and voting and coming up to, you know, what is the, you know, consistent interpretation of all of these sensory inputs that we're getting? What is the single consistent concept, if you will?

Starting point is 00:20:33 And based on that, make the motor actions that are most relevant to that. So it's a sensory motor loop. It's a constantly recurring system. We're constantly making predictions. As we discussed earlier, we are constantly learning. Every cortical column is constantly updating its connections, constantly updating its weights. It's building and incrementally improving its way. world model constantly. So it's a massively distributed, you know, set of processing elements

Starting point is 00:21:05 that we call cortical columns that are, they're all equal operating in parallel. So I think there are similarities for sure between them, but at least the way I described it, I think it's very different in its operation than what I understand today is LLM's. I don't know if you agree with that or not. Yeah, to better understand the other question, which is, are these cortical columns relying on the fact that these are essentially multiple views of the same process and those multiple views like the, you know, the part of the sensory input that gets allocated or subdivided, is it happening at the same time point? So in other words, if you could artificially delay by some time T some cortical columns with respect to the rest, would the learning suffer? So in other words, how important is it that it's kind of on the same schedule? Yeah, I mean, that's another, I mean, LLMs today, you know, you get your input, you one layer process it, then the next, then the next, and the other layers are not operating in the brain.

Starting point is 00:22:06 It's not like that. Everything is working, operating in parallel asynchronously. And this is important. They're constantly trying to make predictions and so on. So if you were to artificially slow down some of your cortical columns, you would absolutely suffer. Your thinking would absolutely suffer. I wanted to interject here just because this is where this discussion is where, you know, I got super

Starting point is 00:22:29 interested in the difference and then spent a bunch of time with Sumatai to learn from him. So in the, if I think about my skin, you know, which is an organ, you know, there's, as I understand it, there's a cortical column attached to a patch, each patch of my skin. And the size of that patch kind of corresponds to the nerve density there. So you can think, so in my brain, there is a set of cortical columns that are skin sensors. And I could actually, if I numbered all the cortical columns in the brain, I could draw a map on my skin and say,

Starting point is 00:23:06 this is number 72 in this patch, this is number 73 in this patch. Now, are human cortical columns, like better than say what we see in a mouse? And of course, this is a leading question because I know the answer. Yeah, so, yes, you know, cortical columns in your sensory areas, primary sensory areas, each, you know, pay attention to or get input from, you know, some patch of your skin somewhere on your body. And there's many more cortical columns associated with your fingertips than there are, then you know, a square centimeter of your back, for example. So there's definitely, you know, areas of sensory information that we pay a lot more attention to and devote a lot more. more physical resources do. In terms of a mouse and humans,

Starting point is 00:23:53 it's pretty remarkable that the cortical columns, so all mammals have cortical columns. All mammals have a neocortex. All mammals have cortical columns from a mouse all the way up to humans. And mice have cortical columns that are very, very similar to what a human has. It's not identical.

Starting point is 00:24:10 There are differences, but by and large, there's the architecture of a cortical column in a mouse is very, very, very, similar to cortical columns in humans. Human cortical columns are bigger, there are more neurons and there's more detail there, but essentially it's the same.

Starting point is 00:24:27 Yeah, maybe just scaled up a little bit. Yeah, so evolution basically discovered this structure that it's really excellent for processing information and dealing with it, and then through very fast in evolutionary time, basically figured out that if you could scale up the number of cortical columns, you get more intelligent.

Starting point is 00:24:47 animals and that's what happened very, very fast evolutionally. I didn't know about the unevenness of cortical columns present. Like, I'm not a neuroscientist, sir. And so this is interesting because one of the biggest frustration with many modern architectures of models is that they deploy a constant amount of computation no matter what the input is. So to the, I go through the same number of layers, number of layers, whether I'm trying to predict the word dog after the, or whether I'm trying

Starting point is 00:25:22 to solve, like, give the final answer to a very complicated math question, or, you know, whether a theorem was proven or not in the prompt. And so that's, that's interesting because, like, some current incestration of modern architectures actually deploy, try to cluster things together such that you have a constant amount of information that you then push together through the model. And so maybe, like, on my fingertips, I need more processing than I need on my elbow because, like, you know, and so this kind of makes sense. Niccolo is being humble. He was working on this problem two years ago and told me about it.

Starting point is 00:25:58 And it was one of the things I learned for you that made me think differently. I just like to refer to people who are here. Yes, yes. Random average people who are not all necessarily brilliant AI scientists. So the prediction part of this, though, is really what's fascinating to me, because, again, something else Sue and I discussed many years ago, you know, if I'm, if I'm, like, moving my finger towards the table, and my brain is making predictions because I have a world model,

Starting point is 00:26:26 it knows a table is there, and the cortical columns representing that patch of skin, as it's getting closer, they're starting to predict that I'm going to feel something that feels like the table. And, oh, there I hit it, prediction met. But if I touched it and it felt really icy, cold or super hot or fluffy, or not there, I passed through it, I'd get a flurry of activity because the prediction wouldn't match the world model, and that's where learning would happen.

Starting point is 00:26:54 Zubatai is, does that sound like the right model and intuition? Yeah, that's definitely a very important component of it. We're constantly making predictions, and as you said, you know, you're moving your right hand, right fingertip down. You've, you know, perhaps you've never sat in this room before or, you know, seen this table before. you would still have a prediction, a very good prediction of that. Because you know what a table is. You know what a table is.

Starting point is 00:27:18 And if it was different, you would notice it right away. But if your left hand, which you weren't paying attention to, also felt icy cold. And then you would notice that as well. So you're actually making not just one prediction. You're making thousands and thousands of predictions constantly about where you are. Every cortical column. Every cortical column is making predictions. and if something were anomalous, highly anomalous, you would notice it.

Starting point is 00:27:45 So this is something we don't often realize. We're making very, very granular predictions constantly. And when things are wrong, we do learn from it. And the other interesting thing, and this is, again, possibly different from how LLMs work. You know, if I were to tell you to touch the bottom part of the bottom surface of the table, you could, without, again, without looking at the table or opening your eyes, you would be able to move your finger in and touch the bottom of your table because you have a set of reference frames that relate to, there you go,

Starting point is 00:28:24 yep, you're able to do it. Amazing. Even though you've maybe never been in this room, maybe never seen this table before. It doesn't matter. I've been in this room because we had to prep for the podcast series. But I didn't touch the underside of the table, that's for sure. Well, yeah, exactly. Exactly. So, you know, we know where things are relation to each other, where our body is in relation to everything, and we can very, very rapidly learn. And again, if the bottom part of the table was anomalous, you would know, you would notice it and potentially remember that.

Starting point is 00:28:52 I was expecting you to find something under that table like a talk show. If you reach under the table, you're going to find a copy of my paper. You know, if I was, if I was smarter and better prepared, that's exactly what would have happened. Sorry, guys. I think you told me something, Subitai, you know, that, and I'll give a little bit of preamble. So, you know, the brain has these dendritic networks in each neuron, and they form synapses. And so a neuron fires and that, you know, the axon of the neuron that's firing will propagate a signal through the synapses, which might do a little signal processing. to the dendrites of the downstream neurons.

Starting point is 00:29:36 And those downstream, the dendrites can then prime the neuron to fire. That's one of the fundamental mechanisms. And it's the formation of those synapses, you know, between the upstream and the downstream neurons, the dendrites, that seem to be the basis of learning. And to me, that feels a little bit like an attention map. So maybe the dendritic network is doing something akin to self-attention. We have some work going on in that direction in MSR.

Starting point is 00:30:06 But the question, the thing you told me was that your brain is actually forming an incredibly large number of synapses speculatively, in some sense, sampling the world in case, when something happens, in case it will recur. You know, it's a more, maybe it's a version of heavy in learning, right? You know, things that fire together, wire together. Exactly. But then if that pattern doesn't recur, then they get pruned. And I'm just going to, you know, what is the fraction of your synapses to get turned over every three or four days, you know, ballpark?

Starting point is 00:30:41 Okay. Yeah, I remember this. This is an absolute mind-blowing study in neuroscience. And so, you know, the way a lot of learning happens in the brain is by adding and dropping connections. In AI models, it's usually strengthening, you know, high precision floating point number, making it higher or lower, but you're not adding and dropping connections. The connections are always, in fact, everything is fully connected, right, between layers. And so in the brain, you're always adding and dropping connection.

Starting point is 00:31:12 That's a fundamental mechanism by which we learn. And one of the fundamental mechanisms, and this, what I read in a study is that when they looked at adult mice and adult animals, and what they found is that they would look at the number of synapses that were connected over the course of a couple of months. They were able to trace individual synapses in this particular part of the brain. And what they found is that every four days, 30% of the synapses that were there were no longer there four days from now. And there was a new 30%.

Starting point is 00:31:49 And there's a huge number of connections that are constantly being added and constantly being pruned. And my theory of what's going on there is that we're always speculatively trying to learn things. So, you know, there's all sorts of random coincidences and things that we are exposed to on a day-to-day basis. We're constantly forming connections there

Starting point is 00:32:13 because we don't know what's actually going to be required and what's real and what's random. Most of it's random. Most of it's not necessary. And the stuff that actually is necessary will stay on. But we're constantly trying to learn. This is a part of continuous learning that's often not appreciated, I think,

Starting point is 00:32:31 is that we're constantly forming new connections, and then we prune the stuff that we don't need. In an AI model, if you were to do that, it would just go, I don't know, we'll go bananas. Well, so let's double click on that. So when you told me that, the way. This is mind-blowing, that's 30%. It's crazy.

Starting point is 00:32:49 Your brain is going to be totally different a few days from now. It's so mind-blowing. And when you told me that, I spent some time processing it. So a whole bunch of synapses were created and destroyed during that time. But it just made me think that we have, you know, we have all of these columns getting all of this input continuously. You know, eyes, hearing, smell, taste, skin, heat, and then, you know, interactions with people, and then planning and experiences just at every level. and they're constantly sampling all this noise coming in and basically filtering out the noise.

Starting point is 00:33:31 It's kind of like a low-pass filter. But when something statistically significant recurs, it's going to lock and then become persistent. Yeah, yeah, I think so. There's so much that's happening, constantly learning, and when you touch a hot stove or something, there's a flooded dopamine specific to those areas that caused these synapses to strengthen very, very quickly.

Starting point is 00:33:58 You know, most of these synapses that are learned are very, very weak synapses. And so, you know, in this study, they also quantified the turnover in kind of strong synapses versus weak synapses. And it's comforting to know that the strong synapses stay there. It's really these weak synapses are constantly added and dropped. And then some then will become strong. Now, I want to go back, return to Nicola, but with an observation. So when I'm training a transformer, it's also a prediction-based system.

Starting point is 00:34:32 You know, I'm running, I have my input in the training set. I have my masked token or the next token I'm trying to predict. I run it through. I look at how successfully did it make that prediction. And the worse it was, the sort of the steeper, the error, or, you know, I drive back through the network. So, you know, if it's spot on, I don't learn very much. But if the prediction is way off, I've got to change a bunch of stuff.

Starting point is 00:35:00 That sounds analogous to what Subitai was just describing the cortical columns. No, that's right. I mean, I don't know, with one big pet people of mine in pre-training, in particular around pre-training this language one. So, again, for context, like language. models in particular, but many other instantiation of large models are trained in a few phases, usually. One of them is pre-training, where you have some ground truth text and you remove, let's say, just the last word, and then you ask the model to predict the last word. And that's when you get

Starting point is 00:35:37 that loss. Do you get the word right? Do you get the word wrong? One of the big problems that have is that, you know, in human experience, we do not get feedback. every single thought. The primary language models, the way we're training them, at least in pre-training, is that they do a thing called teacher forcing. So they guess the word,

Starting point is 00:35:59 then they get immediately the signal, and then the right word gets filled in, and then they predict the next one. So when you go through like a passage of text, you constantly get this reward, and it's such a bizarre way to train them all. It's necessary because you want a lot of flow of supervision. Like you want like a lot of supervision

Starting point is 00:36:18 to essentially, use all the computation available. But at the same time, it actually makes the models, arguably, a little bit worse than what they would be if you had enough compute to train them without this. I went on a tangent just because it's a pet peeve. It's a really important point, though, because your goal when you're training a model is to get to your lost target with the minimal cost and time, or, of course, like fixed budget and like lowest lost target. But, you know, biological systems also, their goal is survival with energy minimum. And so, like, once you've built a world model that works, right, like touching the table,

Starting point is 00:36:55 touching the underside of the table, nope, still nothing exciting there. Like, it takes very little energy to do that. And I think a tragedy is that we have these, we all have these supercomputers in our heads. You know, the near cortex is what, about 10 watts? And it's this amazing thing, right? They can compose symphonies. And then once, but once we have a world model, a lot of us just stop learning. Because it's comfortable, right?

Starting point is 00:37:18 You don't have to perturb the state. You can go through and, you know, I mean, how many of us go through every day and all of our predictions succeed? There's no surprises, you know, so all the new synapses get swept away, right? That's not a goal of pre-training because then you're just wasting energy, but we're trying to minimize energy consumption. So it does feel kind of aligned to me in some sense. So I've got a straw man I want to hit you with, but before we do, Niccolo, I want to, I want you to talk about your view on compression, like LLM's compressors, because I know this is something you're very passionate about and opinionated about. And I've learned a lot from you on this too. So,

Starting point is 00:37:57 and then Subutai, after this, I'd like to hear your biological response. I mean, your response from a biological perspective. And yeah, that's right, of course. And then I want to try, like, I want to throw out this hybrid straw man. So, Niccolo, tell us about compression. The view is that Basically, the generative models are compressors and basically trying to, in information theoretic sense. And so trying to come up with a better generative model is equivalent to try and find the best compressor for some data. When you say compressor, do you mean lossless or lossy? I mean lossless. You can basically look at the literally my much maligned objective function that you use for pre-training.

Starting point is 00:38:46 which is next token prediction. And you can basically draw a complete parallel to what you would do if you were trying to come up with, you know, try to do compression, which is coming up with the shortest possible code for something that you're trying to compress. And so the two things are the same, and it kind of fits into a broader picture

Starting point is 00:39:09 that, you know, like goes back to Occam's Razor and a comogor of complexity and so on, principle of induction, which is you want short descriptions for likely things that happen in the world. And you want your algorithm that produces the short descriptions to be also short. That's a minimum description length principle. And I do feel like it fits in kind of also what you were saying about the concept of you have a good word model. Why look for surprise? Because it simultaneously it affects both terms, both the algorithm, like your own world model, but also the loss that you incurring something unexpected happens. And so if I'm an agent in the world trying to

Starting point is 00:39:54 minimize the minimum description length of the word, I like to go and seek some indistribution data such that I don't bump up my surprise term too much. Right. And you said, and I think you said at some point that, you know, when I'm training a model, even though you have the same loss point, you know, between Model A and Model B, if I have a steeper loss curve in Model A than Model B, you know, it's getting to a better sort of compressed base vocabulary faster, which makes it more general. The shape of that curve matters from a compression perspective. Yeah, I mean, I think it would help here to expand on what I was talking about in terms of the minimum description length principle. The minimum description line principle is basically

Starting point is 00:40:37 the loss of the model your training, that's one component. So it's a sum over the mistakes you make at predicting or the district, you know, the, the mistake you make a predicting each word. And that's one term. And the other term is the, how long it takes you in code to describe the model and the training procedure to get to that training curve, to produce that training curve. So, so yes, if you look at collectively, one term is kind of fixed. It's a cost, it's a amount of code it will take you to write out a language model, for instance, in code. literally implemented, not the weights, just implement the initialization of it.

Starting point is 00:41:17 And then the training loop. And then on the other side, you have this training loss that gets generated as you start observing data. And of course, because it's a sum, you want to minimize really the area. Like, you want to minimize the sum. And so like a flatter curve is much better

Starting point is 00:41:32 than like a steeper curve, even if it ends up at the end to be slightly better. Yeah, concave is better. than convex. Among other things. Sorry. So, you know, I think that we could do a whole episode on this compression view because it's really fascinating.

Starting point is 00:41:55 And the lossless part of it is what blew my mind. And I think, you know, I'm guessing there are multiple camps here and you're squarely in one camp. So I'm guessing we'll get a bunch of feedback from the other camps. So Subatai. You know, can I think of cortical columns as compressors? Yeah, so it's a good question. You know, there's so much in the compression literature that you can draw insight from.

Starting point is 00:42:25 You know, if you look at the representations in cortical columns and that populations of neurons have, you know, some of the things you have to deal with are that the brain doesn't have a huge nuclear power plant attached to it. you know, we only have 12 watts or so to process everything we want to do. And what the representations that evolution has discovered are incredibly sparse. And what that means is that you may have thousands and thousands of neurons in a layer, but only about 1% of them will actually be active at a time. And so it's a very small subset of neurons that are actually active. I don't know about this minimum description length,

Starting point is 00:43:10 whether that applies. I can say a couple of things about that. There's, you know, by and large, the representation are very sparse when you're predicting well. When you see a surprise, there's a burst of activity. When there's something that's unusual,

Starting point is 00:43:25 there's a lot more neurons that fire. That's why learning is tiring. That's what learning. Exactly. No, no, that's right. That's right. And so what we think is happening is that, you know, the actual representation of something

Starting point is 00:43:38 is a very small number of neurons. when you're surprised, there may be many things that are consistent with that surprise. And so your brain represents a union of all of those things at once. And when you have a very sparse representation, you can actually have a union of many, many different things without getting confused. So that's what we think is going on there. So it is a very compressed, very efficient representation. And because it's such a small percentage of neurons that are firing, we are very, very parsimonious in how we represent things.

Starting point is 00:44:09 and extremely energy-efficient metabolically. I wanted to get to the efficiency point, but before I do, you talk about this one to two percent of the neurons firing, but the brain is actually much sparser than that at a fine brain. Right? Because you have 1% of the neurons firing, but they aren't connected to all the other neurons in the region. So really the sparsity should be the product of the connectivity fraction,

Starting point is 00:44:39 times the activity factor. Yeah, yeah. And that's about one out of 10,000, something like that. Exactly, yeah. So something like maybe 1% of the neurons are firing at any point in time and maybe 1% of the connections that are possible are actually there at any point in time.

Starting point is 00:44:59 So it's a very, very small sub-network through this massive network that's actually being activated. A tiny percent of neurons going through a very, very tiny piece of the full network. You know, it's common to, you know, some people say, oh, we're only using 1% of our brain. That's not true. It just means at any point in time, you're only using 1%.

Starting point is 00:45:20 But at other points in time, a different 1% is being used. So, you know, the activity does move around quite a bit. But any point in time, it's extremely small. So, okay, the sparsity, I think, you know, the representation, how the brain is doing this compression biologically is super fascinating. And I want to go on a little bit of a detour now to efficiency. So I remember in 2017 when in MSR we were building, you know, hardware acceleration for RNNs and then the transformer hit, and they were optimized, you know, to be highly paralyzable across this

Starting point is 00:45:55 quadratic attention map for GPUs. The way I would describe it is that that transition to semi-supervised training, moved us from an era where we were really data limited. Like you had to have good high-quality labeled data to you were compute-limited. And when that transition happened, we hockey-sticked from, I'm building faster machines,

Starting point is 00:46:20 but I'm limited by data to the bigger machine I can build, as long as I have enough unlabeled data of high-quality, the better I can do with a model. And so we went on the super computing arms race, and now we're building these like just gargantuan machines. And really, we've kind of been brute forcing it. I mean, we've done a lot of things to optimize, like quantization, you know, and other, other, and, you know, better process node, you know, better, more efficient tensor unit design. But to first order, we've been training bigger models by building bigger systems. And I just wonder, do you think that the brain at this 10 to 12 watts in the neocortex has just has a fundamentally more efficient learning mechanism?

Starting point is 00:47:12 Or do we think that, you know, what we're doing in transformers in the most advanced silicon is as efficient, we're just building much larger and more capable models? Oh, I think without a doubt transformers are extremely inefficient and very, very brute force. We touched on this a little bit earlier in the attention mechanism where we're, you know, transformers are essentially comparing every token to every other token. I mean, there are architectures which reduce that for sure, but it's essentially an N-squared operation. And we're doing this at every layer. I mean, there's nothing like that in the brain. Our processing, you know, in some sense, the context for the very next word I'm about to say is my entire life. Right. It's, and the amount of time I take to take the next word doesn't depend on the length of that context at all. It's a constant time dependence on context. So it's a significant, you know, reduction in the compute that's required. You can kind of think about like the brain, I think has somewhere around maybe 70 trillion synapses. When I say the brain, I mean, the neocortex has about 70 trillion synapses and it's using only 12 watts.

Starting point is 00:48:25 And the synapse is roughly equivalent to a parameter. And if you were to take the most efficient GPUs today and try to run a 70 trillion parameter model, it would be something like a megawatt of power. It's tens of thought, it's orders of magnitude more inefficient than what our brain is doing. So I absolutely believe that. The metric I used to go back to your point, you know, is

Starting point is 00:48:52 this is something, I think we talked about this back in the day, right? When, you know, after this kicked off for a few years, we were trying to project, like, how far would this go under the current model to inform the research and the directions you took, which is why I got so interested in sparsely and working with you. And we would look at a training run and just say, how many jewels did it take to train the whole model? How many parameters do we have? And sort of what's our parameters per jewel? And if by that metric, you know, we were off by many orders of magnitude where the brain is. But I don't know that that's the right metric. Any thoughts on that?

Starting point is 00:49:27 Yeah, I mean, in some ways, you know, Transformers, you know, embody more knowledge in them than any human has. Right. It's sort of, it has memorized, you know, the entire internet's worth of knowledge. All scientific papers. All scientific papers, you know, good and bad, whatever, you know, it's memorized everything. So that's something that, you know, humans just cannot do. So there's definitely stuff that that's better in Transformers than humans.

Starting point is 00:49:54 but fundamentally, I think, you know, we're extremely efficient in how we process the next token or the next bit of information that's coming in. And that's something, I think there's a lot we can learn from the brain and apply it to LLMs and future AI models there. I was going to ask a question related to that, because forget memorizing the Internet, but let me give you another example that Transformers do really well. And I'm wondering, like, you know, the human aspect of this and the brain aspect of this, because Transformers, because of the N-square computation,

Starting point is 00:50:26 they're really good at stuff like needle in the A-stack. So I can tell you right now, I can speak, I can talk to you, and I can tell you the password is something silly, like podcast microphone, blue, whatever. That's a password. And then I can proceed and read the entire Odyssey or a bunch of other books to you out loud for the next five or six hours. And then I can ask the Transformer what was the password,

Starting point is 00:50:48 and Transformer will do this nice N-square computation many times and will spit out the password. A human, you know, there will be a decay of that password, and then at some point they won't remember. And depending on the human, it may be the first chapter of the Odyssey or like at the end. So fundamentally, the type of computation that is done is very different. So it always makes me wonder about the efficiency, because it's just like it's a different type of computation.

Starting point is 00:51:14 So the efficiency is kind of like, what are you doing divided by how good are you at doing it. And so when the things we're doing are so incompatible in many ways, that always troubles me a little bit. I don't know. I don't know if there is a question. Yeah, I mean, Transformers can do stuff that humans find very, very difficult to do, absolutely. You know, maybe there's a way to get the best of both.

Starting point is 00:51:43 I don't know. You know, I don't know that it's fundamentally necessary to have such root for force computation to get every all of these features. That's right. Yeah. Yeah, it is a weird thing because, you know, this is why memory palaces works so well. Like, there is a way, though, for a human to remember that my microphone is gray. It's not actually blue, Niccolo.

Starting point is 00:52:05 Mine is blue. You don't see it. It's off camera. It's on camera. Yeah, I know. I was just teasing you. But there's a way, like, if I can just connect it to enough things, get that connectivity graph, then I'll remember it because it's captured the signal out of the noise.

Starting point is 00:52:20 and connect to do enough things I can retrieve it. And retrieval will be a whole other topic we don't have time to get into today. But I do, now I want to go to the Strawman. So let's take continual learning off the table. Let's imagine that as I go through my day, I'm just saving all of the sensory data to put it in my training set.

Starting point is 00:52:42 And now imagine that I take 100,000 little transformer blocks and I'm training them each with what they're seeing. Okay, I replay the day. So, again, I don't have to worry about continuous learning. And whatever cross-quarticle column, you know, routing feature of the outputs, the inputs, and there's supertie. We've talked about this is a complex set of wiring there

Starting point is 00:53:07 to bring features from here to there that gets learned. If I replicated that, could a transformer block kind of do what the cortical columns are doing? It's like, could I just instrument all my sensory patches with little transformer blocks and then wire them up in the right way and have it work? I think there's still a couple of things we need. One is that cortical columns are fundamentally sensory motor,

Starting point is 00:53:35 and so they're actually each one, each cortical column is initiating actions as well. So you cannot have a static data set fundamentally. ahead of time. It's always a dynamic because we're constantly making movements to get the next bit of data. And so could I tokenize that though? I mean, you could tokenize the input and you can tokenize the output, but you know, if you were to play the same set of inputs back again to a network that a quarter column that's randomly wired differently, it may make a different set of actions.

Starting point is 00:54:11 And so as soon as it makes the first action that's different, that data, set is no longer valid. Right. It's, you know, there is, you can't fundamentally, you have to have a simulation of an environment rather than a static one-way data set, if that makes sense. So I think that's one piece that's, I think, missing in transformers today is this sort of sensory motor loop. And then the other piece we talked about is continuous learning.

Starting point is 00:54:42 I guess you said, take it off the table. but that is a fundamental difference. Yeah, yeah. And maybe one other difference. We talked much earlier about a single latent space and the prediction that's being made at the top of the transformer that you compute the lost function and that's back propagated through the transformer.

Starting point is 00:55:00 That's not how neurons learn. Neurons are making, every neuron is actually making predictions. Every neuron is getting its input and it's learning independent of anything that happens at the top. And so it's a much more granular learning signal. And information does flow from the top to bottom, but there's also many, many other sources of information that it's learning from.

Starting point is 00:55:23 So it's different in that sense as well mechanistically. The reason I ask, and now I'd like to get into some of the fun speculation, because it's been a phenomenal discussion with the two. I think we've kind of elucinated the differences. Something I've wondered after I've talked to both of you, and, you know, Niccolo kind of learning about this compression view of the world, lossless compression, and Subatai, just the, you know, the thousand brains theory in these

Starting point is 00:55:51 cortical columns and the sampling of, you know, the world to capture the signal that you can learn from. So let's say that I was able to design a really small, efficient digital cortical column. Maybe it's transformer-based with some, you know, a sparse representation and some sensory motor mechanism built in, maybe it's more dendritic-based, you know, mapped into digital hardware. And I put that, I put those, a cortical column on every sensor I have in the world, every associate them with every person, and wire them up together with some of this, and then have billions of them that can form higher level abstractions. Like, what do you think would happen? What could we do?

Starting point is 00:56:39 that's a fantastic thought exercise i think that's um you know again assuming the cortical column is faithful and can generate you know or suggest motor actions as well i mean in some sense you you could potentially have a super intelligent uh system right that that's far more intelligent than anything else on the on the planet uh now we're scaling the number of cortical columns uh you know not from a mouse to 100,000 columns that a human might have, but potentially billions of cortical columns and way more. And there's no reason to think there's any fundamental limit there. This sort of a system is, I think, the way that superintelligence systems

Starting point is 00:57:27 will eventually be built. But this is a very different direction than the one we're currently headed down with these monolithic models where we're doing tons of RL, you know, to capture, you know, to get high value human collaboration in distribution. Yes, it's completely different than the direction where we're proceeding. So I think that, you know, to go down that path, there needs to be a fundamental rethinking of some of our assumptions, potentially even down to the hardware architectures that are necessary to implement it, the fundamental learning algorithms, the fundamental training paradigm.

Starting point is 00:58:07 We talked about, you know, you can't have a static data set. You're constantly moving around in the world and doing things. So it's a very, very different way of going about AI than what we're doing today. Sounds like a great time to be an AI researcher. Absolutely. Nicola, what was your reaction to that hypothesis? It sounds super interesting. I mean, my brain was churning, you know, my background is very different.

Starting point is 00:58:37 So I'm in a much worse position to answer this question, but I was starting to think, okay, so let's say I do this. What would be my loss function? What, you know, how would information flow through this system? Like, sounds like cortical columns would each have their own loss, that then I would aggregate, and then I would add a contribution that is like higher level. And then back to my question, you know, like I was the temporal information coordinate. Because one way to see this is that, you know, the way I'm coming to understand this is that it's kind of like a multi-view framework.

Starting point is 00:59:12 You have the same phenomenon represented to multiple independent, but at the same time views. And so part of them is like, it feels like that you need to tie together these cortical columns in such a way that they all get that gradient feedback, if you're training with gradient-based methods, for instance. And so that's kind of, it feels, it feels super, super interesting. It is related to a lot of, you know, very superficiality to a lot of it is in machine learning around, hey, is it better to have one giant super deep network? Is it better to have a bunch of shallow networks? But the difference is also in the way you train them, right? We typically train this bunch of shallow networks on kind of the same objective and the same

Starting point is 00:59:51 data and not typically into an experiential cycle, whereas it sounds like this is a different way to do it. Right, right. I think I want to pull this back around to the title of the podcast. And so I'll share an observation. So I've been using some of the latest models to code. You know, they're getting better really fast. I've been using them to kind of relearn some of the physics that I've never really understood deeply, you know, special in general relativity. Like E equals MC squared. Like why is C in there at all? right, just stuff like that, because now it can actually explain it to me and I can keep beating at it until I understand it and then, of course, work. And at some point I asked the model, can you describe

Starting point is 01:00:41 how I think? And I was just curious. And it, you know, it gave me a page description that my jaw dropped because I said, this thing knows me better than I know myself. I don't think any human being, including me could have captured kind of the way my approach to learning and my brain works. And I just read it, like, yep, that's right. And I learned something about myself. So I wouldn't say that it passed the Turing test, because this was way beyond Turing test. This was like, this thing knows me way better, you know, than I thought any machine ever could. I mean, I'm having a conversation with it. It could be human, but it's superhuman. So in some sense, it's like, intelligent beyond human capabilities with its ability to discern patterns in how someone's

Starting point is 01:01:34 interacting. And yet, it's a tool. You know, it's not conscious. It doesn't have agency, embodiment, emotion. It understands a lot of that stuff from the training data. But at the end of the day, it's a stochastic parrot, right? It's got, you know, it's got the weights and I give it a token and it outputs a token. So, like, are these machines intelligent or not? I'll let's Subuta answer first. Okay. You know, it's definitely a savant, right?

Starting point is 01:02:07 It knows a huge amount about the world. It's absorbed a lot of stuff, and it can articulate that in ways that are just amazing. And, you know, it's taken your chat history with, you know, presumably thousands of chats and able to summarize that in a way, that that's remarkable. At the same time, I think, you know, transformers are not intelligent in the way that a three-year-old is. A three-year-old human is constant, is very curious,

Starting point is 01:02:38 is constantly learning. It can learn almost anything. And, you know, a three-year-old Einstein was able to learn and eventually come up with theories that shook the world, that, you know, E equals MC squared. And so, you know, could have, you know, could have, a transformer do that? I don't think so. And so I think there's still a difference. There's things that can do that are amazing. But there's still basic things that a child can do that transformers cannot do. So I think there's still a gap there. Exactly how to articulate it and how to bridge that cap is, of course, the trillion dollar question. But it is bridgeable, and there is a gap today. Right. Nicola?

Starting point is 01:03:24 I think from my perspective, there are intelligence. There are intelligent. And for my perspective, I go back to the definition of intelligence, which is, like, can you achieve your objectives in a variety of environments? It's a very basic, fundamental, but it's kind of, you know, it can be embodied, a form of embodied intelligence and a genetic intelligence. If I plop you in an environment and I give you an objective, can you achieve it? And in the wilder the environment, the harder the task is.

Starting point is 01:03:57 And I do think, I agree with supertile. Like there is a jaggiveness of intelligence we keep describing. Like these things cannot be simultaneously super good, you know, Olympiad-level mathematicians and still give you stupid answers when you're trying to, I don't know, figure out which cable goes where in your car's battery, you know, like whatever. Well, then it's better than you. me. I'm not an Olympiad level mathematician and I do stupid stuff all the time. I know, exactly. Well, you know, whatever. That was a bad example, but you get it.

Starting point is 01:04:29 But part of me is goes back to the compression view. I do believe that intelligence is compression. So the ability to come up with succinct explanations for complex phenomena and even succinct explanations for complex words and then it implies or leads to your ability to operate within them. And the fact that we are these things that they can prove crazy theorems, but at the same time failed at fairly rudimentary tasks, is a sign that the, yes, transformers are great in terms of inductive biases they put in the world and competition that are great, but we're ultimately all subject to the non-free lunch theorem. You know, across the world, the set of tasks that you could be pursuing,

Starting point is 01:05:14 you know, you have certain inductive biases that kind of of privilege certain tasks at the expense of others. And there isn't like a thing yet that has expanded our set of tasks that are addressable. And so I do think that it's a matter of rethinking our approach to a few things, whether I think likely both on the architecture front and on the losses and the way we train these systems front, I think there is an opportunity to expand the intelligent frontier

Starting point is 01:05:44 of these models. But yeah, for my perspective, they are intelligent already, just in a jagged way. It's such an interesting question. And I know a lot of people write a lot about this. So this is, we're not, I don't think, training any new ground here. But you, you know, there's, there's the diversity of the tasks you can excel at. You know, are you able to handle nuance and understand things deeply?

Starting point is 01:06:08 Are you able to learn continuously right now? The systems can't, right? Are you embodied? I don't know if that matters. Do you have an objective? Well, we could give them one. Are you conscious? Is that, I mean, that's a whole other thing.

Starting point is 01:06:21 So it just feels like there's a bunch of checkbox or check boxes, and we've checked a bunch of them, and a bunch of them are unchecked, and maybe there's no consensus on, like, where that threshold is. Because there are many dimensions of intelligence, and some of which humans don't even have. And that's why we have the term AGI and ASI. And people are debating the G and the S. what is general, what is specialized.

Starting point is 01:06:50 So there is, like, it's a huge discourse, like, for sure. But that's why we had to start characterizing. But if you go back in the definition, you know, going back to my schooling, go back to the definition of intelligent from Plato and Aristotle and the Scarters. Like, in some sense, you see the copos moving through the centuries around what we define as intelligent. And I feel like we're still doing it. Yeah, we'll be doing it for a long time, you know, which an AI velocity is probably another

Starting point is 01:07:17 like four or five years. Hey, I just want to thank you both for the dialogue. You know, I treasure both of you as, as, you know, intellects and scholars and friends. And it was just a joy to nerd out with you all. So thank you both for taking the time. Thank you so much, Doug, for having me. Thank you. Thank you for having us.

Starting point is 01:07:41 This is great. You've been listening to The Shape of Things to Come, a Microsoft Research podcast. Check out more episodes of the podcast. at AKA.ms slash research podcast or on YouTube and major podcast platforms.

Microsoft Research Podcast - Will machines ever be intelligent?

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.