a16z Podcast - Can AI Advance Science? DeepMind's VP of Science Weighs In

Starting point is 00:00:00 AI is not sort of nice to have. It's basically almost a necessity for us to make sense and reason about any problem that we are now looking at. I think there's going to be this really fun cultural shift where 10 years ago, people would say, oh, it's ridiculous. A computer could try to do these things. I think 10 years from now, people will be like, oh, it's ridiculous to have with a human being do that. You can't, like, load all these numbers in your head. Essentially, what we have entered is basically an age where a single human,

Starting point is 00:00:30 in mind cannot comprehend the data that we are gathering about the universe. One of these structures may have taken the length of a PhD, right, to solve a single structure. And now we're talking about true scale. There have been 1.6 or 7 million users of the whole database. Now, if that is not a positive statement about the planet, then I don't know what it is. There are 1.7 million people interested in protein structure prediction. I'm really happy about that. The last few years have been peppered with AI announcements.

Starting point is 00:01:02 Let's recap a few. April 2022, Dolly 2 is released. Mid Journey and Staple Diffusion fast well that summer. Then in November, Chattu BT arrives. Then, 2023 features the release of Claude, Lama, and Mr. All 7B, just to name a few models. And we're only a quarter or so into 2024, and we're already seeing the expansion into AI music and video models faster than almost any way. could have imagined. And while much of the attention circles around creative tools, there was

Starting point is 00:01:34 an AI unlock in biology that caught much attention in 2021. That was Alpha Fold 2. A breakthrough in prediction around the 3D models of protein structures was released and open sourced by the DeepMind team in July of that year. Since then, over 1.7 million scientists across 190 countries have been leveraging the tool. In the meantime, the deep-mine team has been hard at work, seeing how else machine learning can expand the frontier of science across. Many areas of biology from structural biology to genomics, to protein design, to cell genomics, to quantum chemistry, to meteorology, to fusion, to pure mathematics, to computer science.

Starting point is 00:02:20 They've released papers like high-acuracy weather model, graphcast in November, Alpha Geometry in January, which approached the level of human Olympiad gold medalist, and other papers across materials, mathematical functions, and more, including, of course, continuing to push forward Alpha Fold. And today, we have the pleasure of hearing directly from DeepMinds VP of Research focused on science, Pushmeet, Coley. Pushmeet sits down with myself, and A16Z general partner Vijay, who has long been part of this intersection himself.

Starting point is 00:02:54 As a longtime professor at Stanford, spanning several departments from computer science to structural biology to biophysics, and was also the founder of the Folding at Home project released in the year 2000. Together, we reflect on the journey to AlphaFold. But more importantly, where are we in the trajectory of AI meaningfully impacting the way we perform and unlock new science? From new lab economics to clinical trials, to drug discovery, and more. So the question becomes, can artificial intelligence help us uncover fundamentally new science? And has it already done that? Let's find out.

Starting point is 00:03:33 As a reminder, the content here is for informational purposes only, should not be taken as legal, business, tax, or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund. Please note that A16Z and its affiliates may also maintain investments in the companies discussed in this podcast. For more details, including a link to our investments, please see A16C.com slash disposures. So AI has been the talk of the town. A lot of people are familiar with the consumer LLMs, think chat CBT, maybe mid-jury. But AI has been around for quite some time, and it's also impacting the scientific sphere, which I think is so exciting, and I think both of you do too. So Pushmeet, Maybe we could just start there and talk a little bit about your background, how you kind of got

Starting point is 00:04:24 into this intersection of science in AI. And also you work for DeepMind, which I feel like for one of the flagship AI companies, why have you chosen to focus more there than perhaps some of the others? Yeah, so I took a very roundabout journey into what I do today at DeepMind. I'm a computer scientist by background and was hired at Microsoft Research and worked there for a decade, mostly working on applied mathematics, solving difficult maths problems, and most of them were encountered in machine learning. So I started with computer vision, computer graphics, information retrieval. And after having gone through many of these

Starting point is 00:05:07 applications, was very excited about deep learning when it finally sort of emerged. I really thought that this was a game changer in terms of how machine learning is going to impact. applications. Dennis Asabas, who is the CEO and founder of DeepMind, at that time, Deep Point was a young starter, and he reached out and said, well, we know you from some acquaintances, why don't you join us? And I said, no, you guys are working on games at that time, and I went to products and applications.

Starting point is 00:05:40 And he said, well, the whole games thing is just phase one. the idea is to eventually impact science and impact applications which are the biggest challenges in the world and the level of conviction with which he basically made his case I was convinced this guy gets it and so I moved to DeepMind in 2017 and I told him if you're very serious about real world applications we need to make sure that machine learning systems are reliable so in fact when I joined DeepMind I founded the reliability and safety sort of team at DeepMind. And around a year into it, Demmez sort of asked me once, between you're really interested in multidisciplinary research, where you want to apply

Starting point is 00:06:27 machine learning in impactful problems. And I think the most impactful area that you could work on is science. And that was a complete left field sort of suggestion. The last class was in school. So I was quite skeptical, to be honest. I told him, like, you've got the wrong guy. Like, I have no background in biology or physics or chemistry. But he said, no, I mean, the way you are approaching these things, it's good. Let's sort of give it a try and see where it goes. And so we started the science program with six or seven people working on two projects. And now it's almost 140 person team. And we have 10 different initiatives, spanning many areas of biology from structural biology to genomics, to protein design, to cell genomics,

Starting point is 00:07:19 to quantum chemistry, to meteorology, to fusion, to pure mathematics, to computer science. So it's a long journey, but started with sort of an accident. Yeah, and also a very scientific, iterative approach. I love that. Vijay, before we jump into more of those projects that Pushmeet kind of alluded to there, I'd love to hear your background and how you got into this intersection of science and AI, because you also have quite the storied history there. Sure, yeah. So from 1999 to 2015, I was a professor at Stanford. And actually in a variety of departments, my home department was chemistry, but also had appointments in computer science, structural biology, and was also chair of biophysics. And at that intersection, it was clear that machine learning was a very exciting tool to use. I think what really was happening, early with genomics in the 90s and then just plowed all the way through was the rise of data in biology and biology becoming very quantitative. And once starts becoming quantitative, machine learning is very

Starting point is 00:08:19 natural. As Pushme talked about, I think, where a lot of us and myself included got particularly excited was maybe 2013, 2014, 2015, as deep learning was emerging. And I think machine learning before deep learning was human beings have to coming up with their features, and it was like a little tool. With deep learning, it could be something that replaces more and more of the human part of the thinking. And actually, a lot of the interesting results are emergent after that. And those immersion properties got very exciting. It was clear at the time that we need a lot of compute. And so actually, early on in 2000, I founded the Folding and Home Distribute Computing project. And actually, we were some of the first program GPUs. And so all of that comes together.

Starting point is 00:09:03 data, the compute, and then finally the algorithms, once those three pieces were together, I think many of us could see that this was taking off and it was time to dive in. Absolutely. I think that brings us to this question of the why now. So you kind of already addressed it. But Vijay, what gets you so excited about this intersection? We're recording this in 2024. AI has really been around since maybe the 50s. Is it just that we have the right amount of compute? Is it that we have these unlocks when it comes to the modeling? Give us a little bit of a picture of what gets you so excited about what's to come before we dive into some of the specific examples. Yeah, if you step back, I think what we're really seeing in biology

Starting point is 00:09:42 is this industrial revolution, that if you look at a biology lab, maybe even today to some extent versus 10 years ago versus 50 years ago, there'll be benches and people in white coats and pipting and so on. And maybe the boxes on the benches are a little different, but it's very, very similar. It's very bespoke and artisanal. What is shifting is that, becoming industrialized. We're seeing the rise of robotics, and we're seeing with that industrialization this immense amount of data. And so AI needs data and data needs AI. And so as biology gets all that data, we can sort of lean into this. And what's most intriguing is that life sciences and healthcare largely has not been permeated by technology, not by IT to a great deal. And health care

Starting point is 00:10:28 and life sciences collectively, it's almost like becoming 25% of U.S. GDP, these are trillions and trillions of dollars going through this, and none of it, or very little of it, being sort of revolutionized by tech. So this revolution, I think, is happening because of AI. AI is allowing this industrialization to happen, and especially turning these bespoke artisanal processes into something that is engineered and industrialized. AI is one aspect of it, and there's many others. I talked about robotics.

Starting point is 00:10:56 And that's the arc that's, I think, exciting. And it's something where I think we saw hints of it in 2015. It's probably a 25-year arc, maybe 30-year arc that we're 10 years into. And industrial revolutions don't happen overnight. But when you look back, the whole world's going to be changed. And so we're living in the middle of it. And I was actually always jealous about people living in the 1920s and people going from nothing to steam trains and all this stuff. And actually now we're the ones that I think are in the center of it.

Starting point is 00:11:23 It's such an exciting time. Right. You see that picture of, I think it's somewhere in New York. where you have all of these horses lined up, right? And back then, that just felt like the norm. And then you see what, like a decade later, it's all replaced by the equivalent of cars. And so, Pishmete, maybe we could use AlphaFold as an example here

Starting point is 00:11:39 because a lot of people listening to the podcast are maybe most familiar with that paper and that breakthrough, but maybe also another great example of how that didn't happen overnight. I think most people noticed it in 2020, but it didn't start in 2020. And so maybe you could talk about that arc. What is AlphaFold? How did it come to be? and then also where are we today in terms of its impact?

Starting point is 00:12:01 Yeah, so AltaFold, I was telling how I started my journey with the science program at DeepMind and at that time we had these two small-scale sort of projects. One was protein structure prediction and the other one was quite to chemistry. And AltaFold sort of rose from that protein structure prediction project. In its simplest form, it's a very simple problem where given an amino acid sequence, which constitutes a protein, you want to understand the 3D coordinates. of those amino acids. And that's pretty important because if you understand the 3D structure of the protein,

Starting point is 00:12:33 that informs and gives you an idea about what the function would be of that protein. And that has implications for drug discovery, for understanding basic cellular biology, and so forth. So we started working on this problem because we thought it satisfies one of our key requirements when we look into problems. That is its real foundational root note problem. Once you solve it, it has so many different implications and disease understanding and biology and so sensitive biology as well. And not only that, it is a classic sort of machine learning problem. You require reasoning in this problem because you are working with the expanded solution space,

Starting point is 00:13:16 as well as you have access to raw material, which is data. And the structural biology community had done an amazing job in sort of curating a very good data set in the form of the PDP. So scientists all across the world had whenever they found the structure of a protein, which sometimes took almost five years or even a decade in some cases, would diligently deposit that 3G structure in this database. And so at that time, when we started that, there was 150,000-odd structures, both from X-ray crystallography and cryoem.

Starting point is 00:13:55 and that was like an amazing sort of data set to start with. And not only that, the other big problem in machine learning as to how do you evaluate the machine learning model? Because in machine learning, one of the easiest things that you can do is basically fool yourself. These models are extremely good at sort of cheating. And if you give them any sort of way to cheat, they will cheat. So the protein folding community and the protein structure prediction community

Starting point is 00:14:20 had this bi-annual sort of competition called Casp. the critical assessment for a structure prediction and they would run this blind assessment like an olympics of protein structure prediction where people would be given protein sequences whose structure was not known by anyone only like one experimentalist who has deposited it and then they would be tested and the true generalization ability of the model would be exhibited so we thought this problem really We checked a number of key criteria, which we use for taking up a problem for the very long term. So we started with a team which investigated how much progress we can make on this. We were hopeful, optimistic that machine learning can play a important role, but we didn't know.

Starting point is 00:15:09 This was a new problem for us. And we were approaching it with a lot of respect. And Bushmeet, what year was this when it started? So we started around 2017. And we took part in the critical. assessment at the end of 2018. And when we entered Alpha Fold 1 in 2018, we were not really sure, like, where would it be, like, maybe in the top three. But it actually performed really well. It not only was the state of the art, but outperformed the state of the art by a margin.

Starting point is 00:15:38 And that validated our sort of hypothesis. And the basic research philosophy at DeepMind has been the multidisciplinary nature of the teams. So we had brought in some really good structural biologists and biophysics people, John Jumper being the lead of Alpha Fold was part of the team at that time. And that gave us a lot of confidence. Now, we were the best in the world, but the model was still not useful, right? It was reducing good results, but it was nowhere close to solving the problem.

Starting point is 00:16:09 And then we had to sort of make a bet. Can we really go after it and solve it once and for all, or this is it? And so the first thing we had to do was start from strategy. We have to throw AlphaFold 1 from the table and said, this approach that we have started is not going to work. What gave you the indication that AlphaFold 1 couldn't take you to the next level? Because I think even in the AI space outside of science, there are a lot of questions around, can we just depend on the scaling laws? Do we need some sort of new unlock to get to, you know, insert problem here? Could be AGI, could be something else.

Starting point is 00:16:46 What gave you the indication that this is great? We're so happy with our results, but we actually need to throw this. out and start anew? Alpha-fold one had adopted a classical approach. This classical two-stage approach, what the machine learning model's job was, given a sequence, it does not predict the 3D coordinates

Starting point is 00:17:07 of the amino acids directly. What it predicts is basically the distance between amino acids. And then there's a second stage which was supposed to take that distance matrix and recover the 3D coordinates. So the machine learning neural networks' job was restricted to find the distances between amino acid residues.

Starting point is 00:17:28 And this two-stage sort of model was very effective, but it was not very elegant in the sense that if you made certain errors, you will not be able to back-propagate back to the neural networks. Because you found the results after the second stage, and the neural network would not get that supervision. So we believe that in order to be able to properly train the model, we need to needed end-to-end. He needed a model which could go directly from the sequence to the structure. And that was one critical sort of element and a change that needed to be made, but it was a difficult change to make because you are starting from a much lower baseline when you are sort of building up that second end-to-end network. So let's fast forward. So you did throw out

Starting point is 00:18:14 AlphaFold 1 and then what happens after that? So AlphaFold 2, we start this long journey where we start making progress on alpha-fold-2 with a much lower sort of performance from alpha-fold-1 even. We have this internal leaderboard where everyone in the team can propose ideas and try out their ideas on the central leaderboard

Starting point is 00:18:37 to see how much of a delta, each idea or each change sort of makes. And we were making steady sort of progress. And then there were times where progress would stagnate and sometimes even for months, it would stagnate and people would ask the question well, have we reached the limit. But over time, and I think around when the pandemic started,

Starting point is 00:18:58 we got some really, really big deltas where we thought we are making real progress. And if you look at the metrics as to how do you quantify protein structure prediction accuracy, it's called GTT. And we had crossed that 80 GTT sort of threshold. And that was like unprecedented. And of course, that also motivated us to. push it even further and later on to a 90 GVT and beyond, right, which we thought is what we needed to do. And so the pandemic happened and it really sort of brought home to the whole

Starting point is 00:19:34 team, the actual importance of the problem, because we were all sort of sitting in our homes sort of shielding. And there were scientists out there who said, if you have the structure of the different SARS-CoV-2 proteins, it would be really helpful. Now, the community very quickly found the structure of the spike protein because it was also very similar to SARS-Co-1, but the accessory proteins of the virus, the structure for those was not known. And so the fact that we could compute these predictions, share it with experts who are trying to deal with the pandemic and think about designing inhibitors and so on, it's really

Starting point is 00:20:19 brought to the team the real world impact and relevance that this fundamental problem has. And around September 2020, when the second KASP competition ended, we got this email from the organizers who wanted to chat. And that was unprecedented. We were sort of surprised, like why did the organizers want to sort of chat so early on? and they was super surprised at how good the predictions were. In fact, some of them speculated maybe this team has cheated in some way. It could be so good.

Starting point is 00:20:57 But apparently there was one particular sort of scientist who had submitted a protein, but did not know the structure. They had hoped that the structure would be obtained by the time the competition ended, but this structure was not known to anyone, literally anyone. And alpha-fold could give them an initial starting point which can solve the structure for that particular protein. So they were totally amazed that such a system now existed in the cast competition. And we later on sort of released Alpha-Fold. And not only was it very accurate, it was also very efficient.

Starting point is 00:21:31 So we decided to, in fact, find the structures for almost all the proteins that are known to scientists, around 250 million of them, and put them in a database with our partners, the European Microbiology, laboratory, M&EBI, and then mean that as a resource that anyone can access. Yeah, that's amazing. And I'd love to turn it to you, Vijay. I mean, you obviously have run a lab for a long time. And you've been on the other side of this, right?

Starting point is 00:21:58 All these researchers who now have access to this database, which, by the way, for the audience, one of these structures may have taken the length of a PhD, right, to solve a single structure. And now we're talking about true scale. and also, again, this being deployed to all the researchers that can access it. So, Fiji, maybe you can just speak to what that really means and also if we can apply this to other areas of science as well. The impact of this is manyfold. And I can speak to it both from looking at it from the academic lens, but also from the last 10 years of investing in startups. And how startups use this as well.

Starting point is 00:22:32 First off, I think maybe it's worth really emphasizing the significance of structure itself. So the reason why universities like Stanford has a whole department. for structural biology is that the structure is typically pretty evocative of function in other biological aspects. Perhaps the most notable example is the DNA structure and that Watson and Creek came up with this structure. And by looking at just the structure, you can imply how DNA is replicated and essentially how genetics works to some degree, the very basics of it. And so maybe that's one of the most sort of dramatic examples, but there's numerous examples where if you have the structure, you can understand a function.

Starting point is 00:23:09 And so structural biology is a final part of how we understand biology from the molecular scale up. And also for a drug design, often if we understand the structure and its dynamics, we can understand how to drug proteins come up with therapeutics much more sort of in an engineered fashion. So the significance of structural biology is huge. It's also a time where structural biology is in a renaissance because, as you mentioned, it used to take many years to come up with experimental structures, but also new methods like

Starting point is 00:23:36 cryoem can come up with structures. in much shorter times, or even days. And so there's a renaissance going there. And I think for structural biology as a field, I think we'll see this combination of new experimental methods and computational methods. And I think what was most striking to me is how experimentalists were going to these databases

Starting point is 00:23:54 and looking at them and using it almost like you would use the human genome database. That the human genome database takes genomics and turns it into a database lookup, that you can basically don't have to do the experiment yourself. if you can just do the computational query, to some degree, I think what AlphaFold did is it took the structural biology of proteins and made it a database lookup. It's not exactly a true database lookup in the sense that this is a prediction, but as the quality of predictions get higher

Starting point is 00:24:22 and higher, it becomes kind of the same thing. So that's huge. I think the final thing that was, I think, most striking is that there's always going to be a shift from academia to industry. And maybe 30 years ago, academics would design computer chips. and new types of microprocessors and so on, new architectures. We don't do that now in academia. I think that's not something that makes sense to do. That's much better done in companies, especially given the scale of what's going on. And I think what was most striking about this is that I think, for multiple reasons,

Starting point is 00:24:51 this is something that DeepMind was perfectly suited to do in a way that academic groups, I think, really weren't. And that shift now suggests that now I think it's a really interesting time for this to sort of leave academia and now be in the industrialized world of startups and companies. That's really interesting. the relationship you're talking about of academia and industry. Something that people talk a lot about these days is whether these different AI models can really fundamentally advance science the way that you typically think of academics as the parties that are facilitating that.

Starting point is 00:25:21 And so I'd love to hear from both of you, maybe starting with you, VJ, what indications, whether it's through Alpha Fold or other projects that you're seeing emerge, actually indicate that yes, these models, these scientific discoveries in a sense, are able to help us actually push the frontier instead of actually maybe just help us be a little more efficient within the zone that we're already in. I think Pushmeet said, well, that structure prediction is a foundational problem, but if you take, for instance, just the sort of arc of drug design, where first you have to come up with understanding the biology, the AI for biology is a very interesting area where we can maybe start to understand the nature of pathways and do this on human biology in ways that don't

Starting point is 00:26:02 require experiments on human beings, which has always been one of the biggest limitations. I think We understand mouse biology really well because of all the experiments we can do, but we could never do that on human beings directly. But AI models for humans, as they become more predictive, and especially just more predictive than a mouse is predictive of human, the mouse is a model in a sense. That gets super interesting for unraveling biology. And so AI for biology is a thing.

Starting point is 00:26:26 We could talk about AI for chemistry, and I think Alpha Fold is in that category, where now we're trying to understand biophysical chemistry, you want to try and understand how can we quickly drug, undruggable proteins, how can we come up with new antibodies and design proteins, that's a whole area. And then finally, I think AI for clinical trials is going to be really where maybe the biggest impact financially will be. Clinical trials could cost hundreds of millions to billions of dollars. Even a 10% improvement on a billion dollar enterprise is huge. And that's where maybe some of

Starting point is 00:26:56 the toughest problems to work on. But I think as we make impact there, I think clinical trials will be better, we'll be probably more easily powered, will be hopefully more successful because we'll be picking the right ones to do. And then that turns into eventually AI for personal medicine, which is in a sense extension of that trial. And so we're now, I don't want you to do an experiment on me as a mouse or rat, but I would love to make sure I get the best drugs for me. And you and I are different and will respond different to drugs. To be able to have that predicted would be huge. So I think there's the arc of that and I think we're just at the very beginning. Definitely. We talked about Alpha Fold, which is very exciting and maybe the most

Starting point is 00:27:35 familiar to folks. But Pushmeet, your team has also created a bunch of other papers that touch this intersection of AI and science, or you could say AI in math or AI in physics. And those are things like materials, graphcast, which has to do with weather forecasting, fun search, alpha geometry. And so I'd love to hear from you again on this probing of, are we moving the frontier forward with these different models, what are you seeing from some of these other projects that your team is working on in terms of AI helping us actually uncover new science? Essentially, what we have entered is basically an age where a single human mind cannot comprehend the data that we are gathering about the universe. And this is true in any field you now encounter. It is true in biology.

Starting point is 00:28:22 No biologists can reason and analyze all the biological data that has been gathered. No physicists can look at and analyze all the high-energy physics data that is being gathered. And even mathematicians cannot sort of look and analyze all the large-scale mathematical simulation data that we can now compute and simulate and find out. And I think what's happened is AI is not sort of nice to have. It's basically almost a necessity for us to make sense and reason about any problem that we are now looking at. I have examples in pure mathematics where work on topology.

Starting point is 00:28:55 you describe a knot in two different sort of definition. There is an algebra definition and there is a geometric definition. And mathematicians understood these characterizations, but never understood the connections between that. And what we showed in one of our sort of works is basically we generated a lot of data for knots in these two characterizations. And somehow, we asked the neural network, can you make predictions about one characterization from the other?

Starting point is 00:29:23 And the idea was, well, the answer should be no. But in fact, it could make predictions. And when we drill down, we found a very nice conjecture that nobody had encountered. And we work with mathematicians who then not only make that conjecture, but actually prove that there was a very elegant, nice relationship between those two characterizations. So this is like completely fundamental discoveries in mathematics that were completely unknown to mathematicians now being uncovered by a machine. learning and AI model. And we are seeing this across the board in any of the scientific areas that we are looking at. We are discovering new insights, new sort of patterns that were not

Starting point is 00:30:07 expected just because the techniques to analyze the raw scale of data did not exist. I think amongst biologists, especially maybe 10 years ago and further back, I think there was often a belief that biology is just so complex that it's just incomprehensible, that that there's no way to even understand it. The only thing you can do is run the experiment and see what happens. And I think we're seeing the beginning of a shift where people are starting to think, well, there are complexities and there's a lot. We don't know a lot to learn, but that AI actually can gather all that together

Starting point is 00:30:40 and start to decipher this and to be a natural language for biology. And I think there's going to be this really fun cultural shift where 10 years ago, people would say, oh, it's ridiculous. A computer could try to do these things. I think 10 years from now, people will be like, oh, it's sort of, stuff with human being do that. Like, you can't, like, load all these numbers in your head. That's just ridiculous to even say that. And we've seen this in other places, like chess. It seemed like impossible that a computer could beat a grandmaster. And then now there's not even

Starting point is 00:31:08 worth trying. Table sticks. Yeah, yeah. And we saw it with Go. We saw it with all these other things. So I think that's just the cultural shift. But I don't think that's a bad thing. I mean, forklift can lift much more than the strongest weightlifter. And we view that as a positive thing. It's always going to be us and them. I think the interest is, question will be is once you can do these things that we can't do, well, what do we do together with that? Yeah, and what can we do? I mean, one of the most amazing things, I think, is that DeepMind, for the most part, has given these models where the results of them to the community. And so researchers have their hands on them. And so maybe we could talk about that.

Starting point is 00:31:43 How are researchers leveraging these new breakthroughs? There's all kinds of stats around we don't have enough cancer drugs or they're in shortages. And those are very real things we want to fix. So Pushmeet, maybe we'll start with you. What are you seeing and your team seeing in terms of this technology being deployed and how are researchers using it? Yeah, so this was another sort of fascinating journey of growth. As I told you, I was not from the natural sciences. So working on Alpha Fold was a learning experience.

Starting point is 00:32:13 But then actually releasing AlphaFold to the community was even a bigger sort of learning experience. So AlphaFold database, when we were sort of building it up, we wanted it to be available everywhere in the planet to all the sort of scientists. But the scale of science was unprecedented. I was not aware of it. The Alphopold database today has been accessed in 190 countries. And there have been 1.6 or 7 million users of the Alphobic database. Now, if that is not a positive statement about the planet, then I don't know what it is.

Starting point is 00:32:49 There are 1.7 million people interested in protein structure prediction. I'm really happy about that, given all the things that are happening in the world. And in terms of the impact, it's again, like an amazing sort of spectrum, we saw Alpha Fold being used in pathbreaking, fundamental biological discoveries. Like my personal favorite in that domain is the nuclear pore complex, the structure of basically the pore complex, like the way a nucleus controls how a material gets into the nucleus and out. I mean, that fundamental structure of that complex are not known. And researchers used alpha fold to structures to be able to piece together the whole complex. A recent paper from the Feng lab showed how you could develop a molecular syringe. And again, they used alpha-fold too in designing that.

Starting point is 00:33:35 And there are so many other sort of areas where people have been using it for developing new vaccines in working on new antibiotics against antimotin resistance and synthetic biology. Like one of the key partners at the early stages was a university here in the UK, which was using sort of alpha-fold to develop and think about enzymes that could decompose plastics. So you have this whole spectrum of fundamental biology, drug discovery, to even synthetic biology and enzyme development that has been impacted by alpha-fold. And so it was very difficult to even predict what would be the uses of the tool. I think there's also just within biology, there's become a shift that I think people are

Starting point is 00:34:21 sort of wrapping their heads around prediction a bit better. I think before experiment was the gold standard and that was all people wanted to hear about. I mean, part is also just the zeitgeist at a time when you deal with large language models, you're basically dealing with predictions of what comes. And I think people have understood the pros and cons of predictions, but that there's massive value in having it. And I think it's, you know, it's funny. that we talk so much about the technology, but I think it's the human shifts and the cultural shifts are the things that we're going to really need to push. And I think what gets me most excited about what pushmeet's just been talking about is that the fact that I think that's the

Starting point is 00:34:56 sign that we're seeing this cultural shift as well. Maybe something else you could speak to, Vijay, that's just coming to mind as both of you are sharing more about these researchers. How does this change the economics of a lab, right? If you think about what we talked about before is like uncovering a structure, it could have taken a whole PhD. Now we have new tools. And we're seeing these economics change in some of the more consumer fields. And those are very obvious. How does this change the economics of research overall? One of the sort of fantasies that one of my former colleagues talked about was what we call a beach biotech, where you have, let's say, one person at a laptop, presumably on the beach, wherever you want to be. And you've got

Starting point is 00:35:36 CROs, these contract research organizations to do the experiments. You have some AWS cloud or whatever, some GCP cloud somewhere to run your calculations, and that one person with AI, I think we're not quite there yet, but I think that's an intriguing fantasy to think about. And I think on the way to the one person's sort of aspiration is smaller teams doing way more with much less capital outlays and building startups, I think, much more efficiently and where they get to results much more rapidly. That challenge is going to be, what I mentioned before, is that the getting to the clinical trials, speeding that up will be nice, but I think the big financial war term will be on the

Starting point is 00:36:16 clinical trial side. But I think the expectation is that AI for biology and understanding targets and so on based on human data, that would also help on the trial side in addition to anything else there. So I think put together, I think we can get to these therapeutics faster, cheaper, and hopefully better. Yeah. And maybe Pishmeat, we could tackle that directly. If you could give a sense for folks who aren't these researchers who aren't already leveraging these tools, how much does it really cost if someone does want to get a protein structure prediction or use some of the other models that we've talked about again, graphcast or materials, etc. Like what cost are we really looking at? Yeah. So for the Aramol database, it's literally free.

Starting point is 00:36:59 You just go to the alpha database and sort of find the protein that you're interested in out of the 250 million proteins and make it and it's there. It's for free for everyone on the planet to use. So really, it has democratized things in a way that scientists in Latin America or India who was working on sort of neglected tropical diseases, for instance, who had no way they could get a structural of a protein that they were interested in can now get access to these structures at the sort of click of a button. Of course, a lot of research needs to be done to take that work and towards a more focused outcome and a lot more investment is needed.

Starting point is 00:37:44 If you are trying to finish and accomplish the vision that Vijay outlined, the other full structures are start, but you really need to think about how does it bind to the ligands, how do you do the ligand design, how do you solve the co-holding problem. So there is a lot of investment that is needed to make these models and make these predictions and refine them for specific applications. And we have a spinoff from Deep Mind isomorphic labs, which is now investing in this area as well. At the same time, we are continuing and work on the foundational sort of side of things

Starting point is 00:38:19 and have now released an announcement or an update on the next generation of AlphaFold, which goes beyond proteins to other biomolecules to nuclear acids like DNA, RNA, BEMs, small ligands, and so on. I think it's amazing that you've opened this up to the community. And I think something I'd love to hear both of your takes on is really the relationship of these models and them being open sourced. I mean, it's a big debate with an AI at large. But I think especially when it comes to science, there's, I think both ends of the spectrum in a way, right?

Starting point is 00:38:56 I think there's nothing more that people get excited about, about this idea of curing cancer, like solving poverty and an agriculture crisis. But at the same time, people also get very sensitive. scared, right? I think that's where people's sci-fi nightmares come to be, right, where they're like, oh, someone can engineer a molecule that can kill us all. And I guess starting with you, VJ, what's your take on this relationship of AI and science and why it should be open source? I think the beauty of open source, and we see this open source for AI and biology, but AI more broadly, is that people can build on top of each other. And I think what's really remarkable about the

Starting point is 00:39:34 I feel, I would say over the last five, maybe, possibly 10 years is that it feels like an amazing result comes out like once a week. And that the key part of that is that it comes out with code or GitHub repo and that you can check out immediately. You don't even have to just believe the results you can run it yourself. People have even open sourced for tests of things. So essentially we're building like a skyscraper where each person builds a new floor and we're going up really fast. And that's what open source can do. In the past, if it wasn't open source, I'd have to read the paper. I'd have to code it myself.

Starting point is 00:40:07 And sometimes the paper may be a little vague for some detail. So I might not bother, right? And I'll just go do my thing. And so I think what open source allows us to do is to build on top of each other and build rapidly. Now, certain parts won't be open source. I think you unfortunately can't open source a drug compound because then no one's going to pay for the trial. And certain things like that, just economics doesn't make sense given these hundreds of millions of billions of dollars and so on. So certain parts will be close source.

Starting point is 00:40:34 and there's hundreds of startups in AI in biology and AI drug design that will maybe take advantage of what's been done, develop their own methods and build on top. And then that's where I think the drugs will come from. You talked about also the concern for how, because this is so powerful, we could maybe do a sort of dangerous things with it. And that's where I think there's a bit of a misconception because actually there's a huge asymmetry between the complexity of drug design for treating disease. And that's a really hard problem to do. But it actually turns out to be really easy to come out with chemicals that actually are dangerous and toxic. In fact, that's why we have phase one trials, because even the things that you thought would really hopefully not be toxic at all turns out to be toxic.

Starting point is 00:41:17 So it's actually very easy to make toxic things. And Google will teach you actually how to get ricin and how to get all this other stuff for better or worse. So I think there the asymmetry is that if we get rid of AI for drug design, you lose all the good and you don't prevent any of the bad, which is all. already here. I think that's a good point that a lot of people don't think about. But Schmidt, maybe you could just speak to why DeepMind has chosen to open source these models, which isn't necessarily the norm across different AI companies. There was a lot of deliberation within the team and within the company on this. I think there were a few different things that went into that final decision.

Starting point is 00:41:57 One was, we wanted to, like, Alpha Fold was that foundational there. It was so foundational. It could, if we had kept it close source, the impact of it, like fully leveraging the impact for society, I mean, that would have been difficult. It was because it's so fundamentally sort of foundational. It's very hard to even predict what are the potential sort of applications of it. Just to give you an example, when we launched Alpha Fold, a couple of days later, somebody did an analysis on the uncertainty associated with Alpha Fold predictions and figured out that in fact, alpha-fold was, even though it was not trained for that, was the best predictor for

Starting point is 00:42:39 predicting disorder in proteins. So that was something that we would not have come up with, right? If he had kept it close-source, someone basically interacting with the models in the community figured that out. So when we were thinking about it, there was, of course, how to maximize social impact and scientific impact of the model. The second one was responsibility. And we consulted a number of experts from structural biology, from chemistry, from drug discovery to figure out what is the right and responsible and safe approach here and even considering the malicious sort of use cases. And after we had done all the due diligence that we felt that this was safe to release and the impact of releasing it and open sourcing it in a wider

Starting point is 00:43:29 sort of way would outweigh any costs that we would need to sort of model, it was decided that we should open source. And I think the decision has been validated by the impact that Alpha Fold 2 has had in the community. Now, of course, that's not true for all the different models. In fact, subsequently, we have had models, which we have not open source. But I think in the case of Alphabet, too, the decision was very, very clear in favor of sharing it with the world in the most freely way possible. ones that you haven't chosen to open source, if you're willing to share, how do you make that decision? There are a number of different factors, both what would be the social impact,

Starting point is 00:44:09 the scientific impact of releasing things versus what is the commercial cost of releasing something while leveraging it for commercial purposes or even the safety sort of argument. So just to give you an example, one of our recent models that we announced last year was alpha miscence. And this is a model for predicting effect of miscence variants. And what the model does, it produces state-of-the-art accuracy in making predictions about whether miscence variants are denied or could be pathogenic. And in this particular case, we felt that the predictions of the model for the human genome, for the human miscence variants, like the 71 million of them, If we release that, that would serve most of the purposes that a clinician or a biologist would be

Starting point is 00:45:00 interested in. So we just released the predictions rather than the model because the model had many other sort of uses. You could run it on different volcanisms. There were other sort of commercial considerations. So it was felt that we could release the predictions. We could share the methodology, but we will not sort of open source the approach. That makes sense. And I think at the very outset, you shared so many different projects or areas of scientific study that your team is working on. I'm just so curious because it sounds like there's been success across many. Are there any areas of science or mathematics that you've tried to address with this approach of using machine learning and AI that's not quite working, whether it be because we don't have the prior dataset as VJ has spoken to that sets the foundation?

Starting point is 00:45:46 I'm just so curious if there's limitations emerging in any of these fields that your team is running into. One specific area that I would love to have impact on, right? And I think, yeah, I would eventually have impact on is systems biology. It's an incredibly important sort of problem to really understand at the system level how biological systems behave. It's just the data and the evaluation is not at a place where it is for maybe genomics, or functional genomics or for structural biology. Before we actually start an initiative in any of these areas, there is a huge due diligence process that we need to undergo because essentially you're making a very long-term commitment and the careers and the impact of some of the best scientists and engineers that we have are being committed to that area.

Starting point is 00:46:40 So we take that responsibility very seriously. And only when the impact, when we are confident of the impact of the problem, we are confident that we have a good evaluation. valuation metric to track progress, and we have the raw material, the data or a simulator to get good data. Only then do we make that long-term commitment towards a specific topic. To highlight the data issue, I think one of the biggest differences between AI for, let's say, language models or AI for video and AI for biology or for healthcare, is that I think most of the interesting data in biology and healthcare is either dark, that there's all

Starting point is 00:47:22 these medical records and so on, that you just get access on the internet, which would be very useful for understanding the healthcare side, trial side, and so on. It's either dark or it's never been measured, and that, oh, we need to do the experiments. I think having the data could be paramount, and that I think that's going to be different than other places. The other places, maybe the algorithms can really drive things because everyone has the same data, more or less. I think here people will be differentiated by their data. And so the innovations will be innovations in AI combined with innovations and data collection. And there are obviously things that interface are active learning and how can use the data

Starting point is 00:47:58 more efficiently and so on. But the data game, I think, is going to be huge. Absolutely. And Vij, I'd love to just get your take. You've spoken to a few examples already. But what different areas do you wish that more attention was being allocated? Or do you just think there's a set of grand challenges that can and will eventually be solved with some of this technology.

Starting point is 00:48:18 The fun thing about Casp, this critical assessment of structure prediction, is that I think it also inspired all these other prospective trials and prospective studies. So there's a ton of that stuff to do, and I think there's a test for predicting binding of small molecules. And I think we'll see in time these types of methods do extremely well in those assessments. But the Holy Grail is, in my mind, being able to predict clinical trials. It's something where, you know, to understand how a drug, works in human biology. And that's a pushme's point is that's a system's biology problem at the

Starting point is 00:48:51 largest scale. And so that is the Holy Grail. And I think we'll probably do it in parts. You could imagine even like models for specific organs or models for specific parts of the body and then we put them together. Mixers of experts is pretty common these days and maybe that would be one approach. But however it gets done, once that gets done to the point where these models are better than the animal models, I think that's where there's really going to be a tipping point and a point where we can just move much, much more rapidly, where we can sort of not get stymied with having to run these animal models, which takes a long time, and it's very expensive. And even there's crazy things, like right now there's a monkey shortage because monkeys are in such high demand to run these

Starting point is 00:49:31 experiments. So I think there's a long road to get there where these models of humans are more predictive than the alternas. But I think once we get there, that will be a major inflection point. Wow. I did not know there was a monkey shortage, but I mean, it really is important to know, right? As in to your point, hopefully we get to a future where some of the things that we're doing in research today seem just so incredibly outdated because we just have better options. Pishmete, what's next up for deep mind in terms of areas of interest? I mean, you're already working on so many things, but we'd love to just get a pulse on what's exciting for you, too. I think what is fascinating about science and in any of these fields is that there's so much more to work on. I mean, even on structural prediction, I just mentioned that the latest version of Ackhopold, the work there is on extending it to general biomolecules, like DNA, understanding RNA, understanding the interactions between small molecules, ligins, and proteins, like bigger complexes, antibodies. There's so many things that we can extend in genomics. We have worked on both gene expression, the coding part of the genome, like with the

Starting point is 00:50:42 Bissens variance and the non-coding part of the genome, right? Or like predict the gene expression, we have made progress, but we are not completely at the end of it, right? So there's a lot that we are doing in all these areas in material science. You've mentioned this model known, which was able to predict 400,000 novel stable compounds. which expands the number of stable compounds known by more than order of magnitude, right? But how do you now take those sort of compounds and then a reason about their specific properties that would be useful in a particular application, right?

Starting point is 00:51:18 So in any of these disciplines, we are not targeting one specific milestone. We are just saying here is a topic and the long term sort of roadmap is to think about a paradigm shift in how science is done in that area and move towards a more rational modeling-based approach and tackling some of the problems that are encountered here. So there's a lot that needs to be done

Starting point is 00:51:43 and we are just trying to focus on specific areas and the new areas come up if the raw materials are there in terms of data and if we are clear on the evaluation metric, we are constantly reviewing them as well. That's amazing. I haven't done as much research as UVJ, but I did do a summer of battery research and materials research where we were trying to discover

Starting point is 00:52:04 new sodium ion transition metal materials. And my summer was literally, I mean, this was when I was in college, so I wasn't very advanced, but it was literally like finding a paper that documented how to synthesize this material in the kiln, mixing it up, creating a little battery, doing it in the glove box and running it and just seeing how effective it was. And obviously in many cases, it was very ineffective, but every so often we found a material. It was truly just trial and error, trial and error, trial and error. And when I see papers like this that do things in a completely new way at scale, way cheaper, you don't have all of these university students just in a glove box day and night. It's so exciting. The end point for me is like, as we talked about, we're kind of

Starting point is 00:52:47 in the middle of this journey and this technological journey, this cultural journey, these cultural shifts, and that it's going to feel like the big goals that I've laid out, let's say, clinical trial things and assistance biology, that's so far off, right? And it's going to take a while. But we can get a lot done in 10 years, collectively, 15 years. So you're thinking about where we were five years ago, 10 years ago, 15 years ago. Now, 15 years ago, people weren't really even talking that much about deep learning or just beginning. So the goals that we have are lofty, but I think we're right in the think of it and all of it, I think, is very doable. It's just now building that tower one step at a time.

Starting point is 00:53:22 It'll be fun to have this chat again in five years. Hopefully sooner. I think one sort of thing that has been very exciting in the last few years is the rise. Of course, there's a lot of excitement about LLMs and foundational models and so forth. And if you look at the impact that's going to have on science, now in most of the projects that I was talking to you about, we were working with structured data, data either which was collected or in the case of some of our fusion work, data that was simulated. But with the rise of foundation models and algorithms, that opens up the possibility of now using unstructured data to feed these models. And so that really opens the door for a large-scale ingestion of scientific knowledge into the models.

Starting point is 00:54:14 And that is a very exciting direction that will, I think, bring a number of other problems now in the feasibility zone, which previously were not there. Of course, there are challenges with understanding uncertainty and sort of hallucination and all these sort of technical problems need to be sort of addressed. But once that is done, I think the impact that's going to have on models for scientific discovery would be amazing. So that's another reason to be excited for the future. Absolutely. And all of the problems you just mentioned are also opportunities for people to go and

Starting point is 00:54:48 fix and be a part of that whole ecosystem. So this has been really wonderful, Pishmet, VJ. Thank you for, as you said, getting people excited about what's to come because I think these two fields intersecting, what a time to be alive here in 2024 to kind of be a part of it. Like you said, Vijay, we're in our equivalent 1920s. So hopefully people in the 2120s will look back at this fondly. Absolutely. Yeah. If you like this episode, if you made it this far, help us grow the show.

Starting point is 00:55:18 Share with a friend or if you're feeling really ambitious, you can leave us a review. at rate thispodcast.com slash A16cd. You know, candidly, producing a podcast can sometimes feel like you're just talking into a void. And so if you did like this episode, if you liked any of our episodes, please let us know. I'll see you next time.

Your Ad Here

a16z Podcast - Can AI Advance Science? DeepMind's VP of Science Weighs In

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.