The a16z Show - Can AI Advance Science? DeepMind's VP of Science Weighs In

Starting point is 00:00:00 AI is not sort of nice to have. It's basically almost a necessity for us to make sense and reason about any problem that we are now looking at. I think there's going to be this really fun cultural shift where 10 years ago, people would say, oh, it's ridiculous. A computer could try to do these things. I think 10 years from now, people will be like, oh, it's ridiculous to have a human being to do that. You can't load all these numbers in your head. Essentially, what we have entered is basically an age where a single human mind, cannot comprehend the data that we are gathering about the universe.

Starting point is 00:00:34 One of these structures may have taken the length of a PhD, right, to solve a single structure. And now we're talking about true scale. There have been 1.6 or 7 million users of the whole database. Now, if that is not a positive statement about the planet, then I don't know what it is. There are 1.7 million people interested in protein structure prediction. I'm really happy about that. The last few years have been peppered with AI announcements. Let's recap a few.

Starting point is 00:01:04 April 2022, Dolly 2 is released. Mid-Journey and Staple Diffusion fastballed that summer. Then in November, Chattu BT arrives. Then, 2023 features the release of Claude, Lama, and Mr. Allseb & B, just to name a few models. And we're only a quarter or so into 2024, and we're already seeing the expansion into AI music and video models faster than, than almost anyone could have imagined. And while much of the attention circles around creative tools, there was an AI unlock in biology that caught much attention in 2021.

Starting point is 00:01:40 That was AlphaFold 2. A breakthrough in prediction around the 3D models of protein structures was released and open sourced by the DeepMind team in July of that year. Since then, over 1.7 million scientists across 190 countries have been leveraging the tool. In the meantime, the DeepMine team has been hard at work, seeing how else machine learning can expand the frontier of science across. Many areas of biology from structural biology to genomics, to protein design, to cell genomics, to quantum chemistry, to meteorology, to fusion, to pure mathematics, to computer science.

Starting point is 00:02:20 They've released papers like high-acuracy weather model, graphcast in November, alpha geometry in January, which approached the level, of Human Olympiad gold medalist and other papers across materials, mathematical functions, and more, including, of course, continuing to push forward Alpha Fold. And today, we have the pleasure of hearing directly from Deep Minds VP of Research

Starting point is 00:02:43 focused on science, Pushmeet, Coley. Pushmeet sits down with myself and A16Z general partner Vijay who has long been part of this intersection himself as a longtime professor at Stanford spanning several departments from computer science to structural biology to biophysics, and was also the founder of the Folding at Home Project released in the year 2000.

Starting point is 00:03:07 Together, we reflect on the journey to AlphaFold. But more importantly, where are we in the trajectory of AI meaningfully impacting the way we perform and unlock new science, from new lab economics to clinical trials to drug discovery and more? So the question becomes, can artificial intelligence, help us uncover fundamentally new science. And has it already done that? Let's find out. As a reminder, the content here is for informational purposes only.

Starting point is 00:03:37 Should not be taken as legal, business, tax, or investment advice, or be used to evaluate any investment or security and is not directed at any investors or potential investors in any A16C fund. Please note that A16C and its affiliates may also maintain investments in the companies discussed in this podcast. For more details, including a link to our investments, please see. A16C.com slash disclosures. So AI has been the talk of the town.

Starting point is 00:04:06 A lot of people are familiar with the consumer LLMs, think chat CBT, maybe mid-jury. But AI has been around for quite some time, and it's also impacting the scientific sphere, which I think is so exciting, and I think both of you do too. So Pushmeet, maybe we could just start there and talk a little bit about your background,

Starting point is 00:04:23 how you kind of got into this intersection of science in AI. And also you work for Deep Mind, which I feel like for one of the flagship, AI companies, why have you chosen to focus more there than perhaps some of the others? Yeah, so I took a very roundabout journey into what I do today at Deep Mind. I'm a computer scientist by background and was hired at Microsoft Research and worked there for a decade, mostly working on applied mathematics, solving difficult maths problems, and most of them were encountered in machine learning. So I started with computer,

Starting point is 00:05:01 vision, computer graphics, information retrieval. And after having gone through many of these applications, was very excited about deep learning when it finally sort of emerged. I really thought that this was a game changer in terms of how machine learning is going to impact applications. Dennis Sassabas, who is the CEO and founder of DeepMind. At that time, DeepMind was a young starter. And he reached out and said, well, we know you from some

Starting point is 00:05:30 acquaintances, why don't you join us? And I said, no, everybody was working on games at that time and I went to products and applications. And he said, well, the whole games thing is just phase one. The idea is to eventually impact science and impact applications, which are the biggest challenges in the world. And the level of conviction with which he basically made his case, I was like convinced this guy gets it. And so I I moved to DeepMind in 2017, and I told him, if you're very serious about real world applications, we need to make sure that machine learning systems are reliable. So in fact, when I joined DeepMind, I founded the reliability and safety sort of team at DeepMind.

Starting point is 00:06:17 And around a year into it, Demmeist sort of asked me once, which we are really interested in multidisciplinary research, where you want to apply machine learning in impactful problems. and I think the most impactful area that you could work on is science. And that was a complete left-field sort of suggestion. The last class was in school. So I was quite skeptical, to be honest. I told him, like, you've got the wrong guy. I have no background in biology or physics or chemistry.

Starting point is 00:06:52 But he said, no, I mean, the way you are approaching these things, it's good. Let's sort of give it a try and see where it goes. And so we started the science program with six or seven people working on two projects. And now it's almost 140 person team. And we have 10 different initiatives spanning many areas of biology from structural biology to genomics, to protein design, to cell genomics, to quantum chemistry, to meteorology, to fusion, to pure mathematics, to computer science. So it's a long journey, but started with sort of an accident. Yeah. And also a very scientific, iterative approach.

Starting point is 00:07:33 I love that. Vijay, before we jump into more of those projects that Pushmeet kind of alluded to there, I'd love to hear your background and how you got into this intersection of science and AI, because you also have quite the storied history there. Sure, yeah. So from 1999 to 2015, I was a professor at Stanford. And actually in a variety of departments, my home department was chemistry, but also had appointments in computer science, structural biology, and was also chair of biophysics.

Starting point is 00:08:01 And at that intersection, it was clear that machine learning was a very exciting tool to use. I think what really was happening early with genomics in the 90s and then just plowed all the way through was the rise of data in biology and biology becoming very quantitative. And once it starts becoming quantitative, machine learning is very natural. As Pushme talked about, I think, were a lot of us and myself, including, got particularly excited was maybe 2013, 2014, 2015 as deep learning was emerging. And I think machine learning before deep learning was human beings have to coming up with their features. And it was like a little tool. With deep learning, it could be something that replaces more and more

Starting point is 00:08:43 of the human part of the thinking. And actually, a lot of the interesting results are emergent after that. And those immersion properties got very exciting. It was clear at the time that we need a lot of compute. And so actually early on in 2000, I founded the Folding Home Distribute Computing Project. And actually, we were some of the first program GPUs. And so all of that comes together, the data, the compute. And then finally, the algorithms, once those three pieces were together, I think many of us could see that this was taking off and it was time to dive in. Absolutely. I think that brings us to this question of the why now. So you kind of already addressed it. But Vijay, what gets you so excited about this intersection? We're recording this in 2024.

Starting point is 00:09:23 has really been around since maybe the 50s. Is it just that we have the right amount of compute? Is it that we have these unlocks when it comes to the modeling? Give us a little bit of a picture of what gets you so excited about what's to come before we dive into some of the specific examples. Yeah, if you step back, I think what we're really seeing in biology is this industrial revolution, that if you look at a biology lab, maybe even today to some extent, versus 10 years ago versus 50 years ago.

Starting point is 00:09:51 There'll be benches and people in white coats and pipeting and so on. And maybe the boxes on the benches are a little different, but it's very, very similar. It's very bespoke and artisanal. What is shifting is that's becoming industrialized. We're seeing the rise of robotics. And we're seeing with that industrialization this immense amount of data. And so AI needs data and data needs AI.

Starting point is 00:10:14 And so as biology gets all that data, we can sort of lean into this. And what's most intriguing is that life sciences and healthcare largely has not been permeated by technology, not by IT to a great deal. And health care and life sciences collectively, it's almost like becoming 25% of US GDP. It's trillions and trillions of dollars going through this. And none of it or very little of it being sort of revolutionized by tech. So this revolution, I think, is happening because of AI. AI is allowing this industrialization to happen. and especially turning these bespoke artisanal processes into something that is engineered and industrialized.

Starting point is 00:10:52 AI is one aspect of it, and there's many others. I talked about robotics. And that's the arc that's, I think, exciting. And it's something where I think we saw hints of it in 2015. It's probably a 25-year arc, maybe 30-year arc that were 10 years into. And industrial revolutions don't happen overnight. But when you look back, the whole world's going to be changed. And so we're living in the middle of it.

Starting point is 00:11:13 And I was actually always jealous about people living in the 1920s and people going from nothing to steam trains and all this stuff. And actually now we're the ones that I think are in the center of it. It's such an exciting time. Right. You see that picture of, I think it's somewhere in New York where you have all of these horses lined up, right? And back then that just felt like the norm. And then you see what, like a decade later, it's all replaced by the equivalent of cars. And so, Pishme, maybe we could use AlphaFold as an example here because a lot of people listening to the podcast are maybe most familiar with.

Starting point is 00:11:43 that paper and that breakthrough, but maybe also another great example of how that didn't happen overnight. I think most people noticed it in 2020, but it didn't start in 2020. And so maybe you could talk about that arc. What is Alpha Fold? How did it come to be? And then also, where are we today in terms of its impact? Yeah. So AlfaFold, I was telling how I started my journey with the science program at DeepMind. And at that time, we had these two small scale sort of projects. One was protein structure prediction and the other one was for chemistry and alpha-fold sort of rose from that protein structure prediction project in its simplest form it's a very simple problem where given an amino acid sequence which constitutes a protein you want to understand the 3D coordinates of those amino

Starting point is 00:12:28 acids and that's really important because if you understand the 3D structure of the protein that informs and gives you an idea about what the function would be of that protein and that has implications for drug discovery, for understanding basic cellular biology, and so forth. So we started working on this problem because we thought it sort of satisfies one of our key requirements when we look into problems. That is it's real foundational root note problem. Once you solve it, it has so many different implications and disease understanding and biology and so sensitive biology as well.

Starting point is 00:13:06 And not only that, it is a classic sort of machine learning problem. You require reasoning in this problem because you are working with the expanded solution space, as well as you have access to raw material, which is data. And the structural biology community had done an amazing job in sort of curating a very good dataset in the form of the PDP. So scientists all across the world had whenever they found the structure of a protein, which sometimes took almost five years or even a decade in some cases

Starting point is 00:13:41 would diligently deposit that 3G structure in this database. And so at that time, when we started that to build 150,000 odd structures, both from X-ray crystallography and cryoem. And that was like an amazing sort of data set to start with. And not only that, the other big problem in machine learning

Starting point is 00:14:03 as to how do you evaluate the machine learning model? Because in machine learning, one of the easiest things that you can do is basically fool yourself. These models are extremely good at sort of cheating. And if you give them any sort of way to cheat, they will cheat. So the protein folding community and the protein structure prediction community had this annual, bi-annual sort of competition called CASP, the critical assessment for a structure prediction. And they would run this blind assessment, like an Olympics of protein structure prediction, where people would be given protein sequences whose structure was not known by anyone,

Starting point is 00:14:41 only like one experimentalist who has deposited it, and then they would be tested. And the true generalization ability of the model would be exhibited. So we thought this problem really checked a number of key criteria, which we used for taking up a problem for the very long term. So we started with a team which investigated how much progress we can make on this. we were hopeful, optimistic that machine learning can play a bottom role, but we didn't know. This was a new problem for us, and we were approaching it with a lot of respect. And Pishmir, what year was this when it started?

Starting point is 00:15:16 So we started around 2017, and we took part in the critical assessment at the end of 2018. And when we entered Alpha Fold 1 in 2018, we were not really sure, like, where would it be, like maybe in the top three. but it actually performed really well. It not only was the state of the art, but outperformed the state of the art by a margin. And that validated our sort of hypothesis. The basic research philosophy at DeepMind has been the multidisciplinary nature of the teams. So we had brought in some really good structural biologists and biophysics.

Starting point is 00:15:51 John Jumper being the lead of AlphaFold was part of the team at that time. And that gave us a lot of confidence. Now, we were the best in the world, but the model was still not useful, right? It was producing good results, but it was nowhere close to solving the problem. And then we had to sort of make a bet. Can we really go after it and solve it once and for all, or this is it? And so the first thing we had to do was start from scratch. We had to throw AlphaFold 1 from the table and said, this approach that we had started

Starting point is 00:16:27 it is not going to work. What gave you the indication that AlphaFold 1 couldn't take you to the next level? Because I think even in the AI space outside of science, there are a lot of questions around, can we just depend on the scaling laws? Do we need some sort of new unlock to get to insert problem here? Could be AGI, could be something else. What gave you the indication that this is great? We're so happy with our results, but we actually need to throw this out and start anew.

Starting point is 00:16:54 Alpha Fold 1 had adopted a classical approach. classical two-stage approach, what the machine learning model's job was, given a sequence, it does not predict the 3D coordinates of the amino acids directly. What it predicts is basically the distance between amino acids. And then there's a second stage which was supposed to take that distance matrix and recover the 3D coordinates. So the machine learning neural networks job was restricted to find the distances between amino acid recidid use. And this two-stage sort of model was very effective, but it was not very elegant in the

Starting point is 00:17:34 sense that if you made certain errors, you will not be able to back-propagate back to the neural networks because you found the results after the second stage and the neural network would not get that supervision. So we believe that in order to be able to properly train the model, we needed end-to-end. We needed a model which could go directly from the sequence to the structure. And that was one critical sort of element and a change that needed to be made, but it was a difficult change to make because you are starting from a much lower baseline when you are sort of building up that second end-to-end network. So let's fast forward.

Starting point is 00:18:13 So you did throw out alpha-fold one and then what happens after that? So alpha-fold-2, we start this long journey where we start making progress on alpha-toe with a much lower sort of performance from alpha fold one even. We have this internal leaderboard where everyone in the team can propose ideas and try out their ideas on the central leaderboard to see how much of a delta each idea or each change sort of makes. And we were making steady sort of progress. And then there were times where progress would stagnate and sometimes even for months, it would stagnate and people would ask the question, well, have we reached the limit. But over time, And I think around when the pandemic started, we got some really, really big dentas where we thought we are making real progress.

Starting point is 00:19:05 And if you look at the metrics as to how do you quantify protein structure prediction accuracy, it's called GTT. And we had crossed that 80 GTT sort of threshold. And that was like unprecedented. And of course, that also motivated us to push it even further. and later on to 90 GDT and beyond, right? Which we thought is what we needed to do. And so the pandemic happened, and it really sort of brought home to the whole team,

Starting point is 00:19:36 the actual importance of the problem. Because we were all sort of sitting in our homes, sort of shielding. And there were scientists out there who said, if you have the structure of the different SARS COV2 proteins, it would be really helpful. Now, the community very quickly found the structure of the spike protein because it was also very sort of similar to SARS-COV-1, but the accessory proteins of the virus, the structure for those was not known. And so the fact that we could

Starting point is 00:20:10 compute these predictions, share it with experts who are trying to deal with the pandemic and think about in designing inhibitors and so on, it's really brought to the team the real impact and relevance that this fundamental problem has. And around September 2020, when the second KASP competition ended, we got this email from the organizers who wanted to chat. And that was unprecedented. We were sort of surprised. Like, why did the organizers want to sort of chat so early on? And they were super surprised at how good the predictions were. In fact, some of them them speculated. Maybe this team has cheated in some way. It could be so good. But apparently there was one particular sort of scientists who had

Starting point is 00:21:01 submitted a protein, but did not know the structure. They had hoped that the structure would be obtained by the time the competition ended. But this structure was not known to anyone, literally anyone. And Alpha-Fo could give them an initial starting point which can solve the structure for that particular protein. So they were totally amazed that such a system now existed in the gas competition. And we later on sort of released AlphaFold. And not only was it very accurate, it was also very efficient. So we decided to, in fact, find the structures for almost all the proteins that are known to scientists, around 250 million of them,

Starting point is 00:21:39 and put them in a database with our partners, the European microbiology, laboratory, and the MBILI, and then made that as a resource that anyone can access. Yeah, that's amazing. And I'd love to turn it to you, Vijay. I mean, you obviously have run a lab for a long time. And you've been on the other side of this, right? All these researchers who now have access to this database, which, by the way, for the audience, one of these structures may have taken the length of a PhD, right, to solve a single structure.

Starting point is 00:22:08 And now we're talking about true scale. And also, again, this being deployed to all the researchers that can access it. So Vijay, maybe you can just speak to what that really means. And also, if we can apply this to other areas of science as well. The impact of this is manyfold, and I can speak to it both from looking at it from the academic lens, but also from the last 10 years of investing in startups. Now, startups use this as well. First off, I think maybe it's worth really emphasizing the significance of structure itself. So the reason why universities like Stanford has whole departments for structural biology is that the structure is typically pretty evocative of function in other biological aspects.

Starting point is 00:22:47 Perhaps the most notable example is the DNA structure. and that Watson and Creek came up with this structure, and by looking at just the structure, you can imply how DNA is replicated and essentially how genetics works to some degree, the very basics of it. And so maybe that's one of the most sort of dramatic examples, but there's numerous examples where if you have the structure, you can understand the function. And so structural biology is a fundamental part of how we understand biology from the molecular scale up. And also for a drug design, often if we understand the structure and its dynamics,

Starting point is 00:23:19 so we can understand how to drug proteins and come up with therapeutics much more in an engineered fashion. So the significance of structural biology is huge. It's also at a time where structural biology is in a renaissance because, as you mentioned, it used to take many years to come up with experimental structures, but also new methods like cryoem can come up with structures in much shorter times or even days. And so there's a renaissance going there. And I think for structural biology is a field, I think we'll see this combination of new experimental methods

Starting point is 00:23:48 and computational methods. And I think what was most striking to me is how experimentalists were going to these databases and looking at them and using it almost like you would use the human genome database. That the human genome database takes genomics and turns it into a database lookup. That you can basically don't have to do the experiment yourself.

Starting point is 00:24:07 You can just do the computational query. To some degree, I think what AlphaFold did is it took the structural biology or proteins and made it a database lookup. It's not exactly a true database lookup in the sense that this is a prediction, but as the quality of predictions get higher and higher, it becomes kind of the same thing. So that's huge. I think the final thing that was, I think, most striking is that there's always going to be a shift from academia to industry. And maybe 30 years

Starting point is 00:24:34 ago, academics would design computer chips and new types of microprocessors and so on, new architectures. We don't do that now in academia. I think that's not something that makes sense to do. That's much better done in companies, especially given the scale of what's going on. And I think what was most striking about this is that I think for multiple reasons, this is something that DeepMind was perfectly suited to do in a way that academic groups, I think, really weren't. And that shift now suggests that now I think it's a really interesting time for this to sort of leave academia and now be in the industrialized world of startups and companies. That's really interesting, the relationship you're talking about of academia and industry.

Starting point is 00:25:10 Something that people talk a lot about these days is whether these different AI models can really fundamentally advanced science the way that you typically think of academics as the parties that are facilitating that. And so I'd love to hear from both of you, maybe starting with you, VJ, what indications, whether it's through Alpha Fold or other projects that you're seeing emerge, actually indicate that, yes, these models, these scientific discoveries, in a sense, are able to help us actually push the frontier instead of actually maybe just help us be a little more efficient within the zone that we're already in? I think push me and said, well, that structure prediction is a foundational problem, but if you take, for instance, just the sort of arc of drug design,

Starting point is 00:25:51 where first you have to come up with understanding the biology, the AI for biology is a very interesting area where we can maybe start to understand the nature of pathways and do this on human biology in ways that don't require experiments on human beings, which has always been one of the biggest limitations. I think we understand mouse biology really well because of all the experiments we can do, but we could never do that on human beings directly. but AI models for humans, as they become more predictive, and especially just more predictive than a mouse is predictive of human, the mouse is a model in a sense.

Starting point is 00:26:21 That gets super interesting for unraveling biology. And so AI for biology is a thing. We could talk about AI for chemistry, and I think Alpha Fold is in that category, where now we're trying to understand biophysical chemistry, you want to try and understand how can we quickly drug, undruggable proteins, how can we come up with new antibodies and design proteins? That's a whole area.

Starting point is 00:26:40 And then finally, I think AI for clinical, trials is going to be really where maybe the biggest impact financially will be. Clinical trials could cost hundreds of millions to billions of dollars. Even a 10% improvement on a billion dollar enterprise is huge. And that's where maybe some of the toughest problems to work on. But I think as we make impact there, I think clinical trials will be better, will be probably more easily powered, will be hopefully more successful because we'll be picking the right ones to do. And then that turns into eventually AI for personalized medicine, which is in a sense of extension of that trial. And so we're now, I don't want you to do an experiment on me as a mouse or rat,

Starting point is 00:27:18 but I would love to make sure I get the best drugs for me. And you and I are different and will respond different to drugs. To be able to have that predicted would be huge. So I think there's the arc of that. And I think we're just at the very beginning. Definitely. We talked about Alpha Fold, which is very exciting and maybe the most familiar to folks. But Pishmet, your team has also created a bunch of other papers that touch this intersection of AI and science, or you could say AI in math or AI in physics. And those are things like materials, graphcast, which has to do with weather forecasting, fund search, alpha geometry.

Starting point is 00:27:52 And so I'd love to hear from you again on this probing of, are we moving the frontier forward with these different models? What are you seeing from some of these other projects that your team is working on in terms of AI helping us actually uncover new science? Essentially, what we have entered is basically an age where a single human mind cannot comprehend the data that we are gathering about the universe. And this is true in any field you now encounter. It is true in biology. No biologists can reason and analyze all the biological data that has been gathered. No physicists can look at and analyze all the high energy physics data that is being gathered. And even mathematicians cannot sort of look and analyze all the biological data that is being gathered. And even mathematicians cannot sort of look and analyze all the, large-scale mathematical simulation data that we can now compute and simulate and find out. And I think what's happened is AI is not sort of nice to have.

Starting point is 00:28:44 It's basically of almost a necessity for us to make sense and reason about any problem that we are now looking at. I have examples in your mathematics where work on topology, you describe a knot in two different sort of definition. There is an algebra definition and there is a geometric definition. And mathematicians understood these characteristics. but never understood the connections between that. And what we showed in one of our sort of works is basically we generated a lot of data

Starting point is 00:29:14 for knots in these two characterizations. And somehow, the neural network, can you make predictions about one characterization from the other? And the idea was, well, the answer should be no. But in fact, it could make predictions. And when we drill down, we found a very nice conjecture that nobody had encountered. And we work with mathematicians, who then not only wake that conjecture, but actually prove that there was a very elegant, nice relationship between those two characterizations. So this is like completely fundamental discoveries in mathematics that were completely unknown to mathematicians now being uncovered by a machine learning and AI model. And we are seeing this across the board in any of the scientific areas that we are looking at.

Starting point is 00:30:02 we are discovering new insights, new sort of patterns that were not expected, just because the techniques to analyze the raw scale of data did not exist. I think amongst biologists, especially maybe 10 years ago and further back, I think there was often a belief that biology is just so complex that it's just incomprehensible, that there's no way to even understand it. The only thing you can do is run the experiment and see what happens. And I think we're seeing the beginning of a shift where people are starting to think, well, there are complexities and there's a lot we don't know a lot to learn, but that AI actually can gather all that together and start to decipher this and to be a natural language for biology. And I think there's going to be this really fun cultural shift where 10 years ago, people would say, oh, it's ridiculous. A computer could try to do these things.

Starting point is 00:30:52 I think 10 years from now, people will be like, oh, it's ridiculous to have a human being do that. Like, you can't load all these numbers in your head. That's just ridiculous to even say that. And we've seen this in other places like chess. It seemed like impossible that a computer could beat a grandmaster. And then now there's not even worth trying. Table sticks. Yeah.

Starting point is 00:31:10 And we saw it with Go. We saw it with all these other things. So I think that's just the cultural shift. But I don't think that's a bad thing. I mean, forklift can lift much more than the strongest weight lifter. And we view that as a positive thing. It's always going to be us and them. I think the interesting question will be is once it could do these things that we can't do,

Starting point is 00:31:28 well, what do we do together with that? Yeah, and what can we do? I mean, one of the most amazing things, I think, is that DeepMind, for the most part, has given these models where the results of them to the community. And so researchers have their hands on them. And so maybe we could talk about that. How are researchers leveraging these new breakthroughs? There's all kinds of stats around we don't have enough cancer drugs or they're in shortages,

Starting point is 00:31:51 and those are very real things we want to fix. So, Pushmeet, maybe we'll start with you. What are you seeing and your team seeing in terms of, this technology being deployed and how are researchers using it? Yeah, so this was another sort of fascinating journey of growth. As I told you, I was not from the natural sciences. So working on Alpha Fold was a learning experience. But then actually releasing Alpha Fold to the community was even a bigger sort of learning

Starting point is 00:32:17 experience. So AlphaFold database, when we were sort of building it up, we wanted it to be available everywhere in the planet to all the sort of scientists. but the scale of science was unprecedented. I was not aware of it. The Alfa-Fold database today has been accessed in 190 countries, and there have been 1.6 or 7 million users of the Al-Fa-Fol database. Now, if that is not a positive statement about the planet,

Starting point is 00:32:47 then I don't know what it is. There are 1.7 million people interested in protein structure prediction. I'm really happy about that. I mean, all the things that are happening in the world. And in terms of the impact, it's again, like an amazing sort of spectrum, we saw Alpha Fold being used in pathbreaking, fundamental biological discoveries. Like my personal favorite in that domain is the nuclear pore complex, the structure of basically the pore complex, like the way nucleus controls,

Starting point is 00:33:15 how a material gets into the nucleus and out. I mean, that fundamental structure of that complex are not known. And researchers used AlphaFold to structures. to be able to piece together the whole complex. A recent paper from the Feng lab showed how you could develop a molecular syringe. And again, they used alpha-fold too in designing that. And there are so many other sort of areas where people have been using it for developing new vaccines in working on new antibiotics against antimotor resistance and synthetic biology.

Starting point is 00:33:47 Like one of the key partners at the early stages was a university here in the UK, which was losing sort of alpha fold to develop and think about enzymes that could decompose plastics. So you have this whole spectrum of fundamental biology, drug discovery, to even synthetic biology and enzyme development that has been impacted by alpha fold. And so it was very difficult to even predict what would be the uses of the tool. I think there's also just within biology, there's become a shift that I think people are sort of wrapping their heads around prediction a bit better. I think before experiment was the gold standard, and that was all people wanted to hear about. I mean, part of it's also just the zeitgeist

Starting point is 00:34:32 at a time when you deal with large language models, you're basically dealing with predictions of what comes. And I think people have understood the pros and cons of predictions, but that there's massive value in having it. And I think it's, you know, it's funny that we would talk so much about the technology, but I think it's the human shifts and the cultural shifts are the things that we're going to really need to push. And I think what gets me most excited about what pushmeans just been talking about is the fact that I think that's the sign that we're seeing this cultural shift as well. Maybe something else you could speak to, Vij, that's just coming to mind as both of you

Starting point is 00:35:02 are sharing more about these researchers. How does this change the economics of a lab, right? If you think about what we talked about before is like uncovering a structure, it could have taken a whole PhD. Now we have new tools. And we're seeing these economics change in some of the more consumer fields. and those are very obvious. How does this change the economics of research overall? One of the sort of fantasies that one of my former colleagues talked about was what we call

Starting point is 00:35:28 Beach Biotech, where you have, let's say, one person at a laptop, presumably on the beach, wherever you want to be. And you've got CROs, these contract research organizations, to do the experiments. You have some AWS cloud or whatever, some GCP cloud somewhere to run your calculations. And that one person with AI, I think we're, not quite there yet, but I think that's an intriguing fantasy to think about. And I think on the way to the one person's sort of aspiration is smaller teams doing way more with much less capital outlays and building startups, I think, much more efficiently and where they get to results much more rapidly. That challenge is going to be, what I mentioned before is that the getting to the clinical

Starting point is 00:36:11 trials, speeding that up will be nice. But I think the big financial return will be on the clinical trial side. But I think the expectation is that AI for biology and understanding targets and so on based on human data, that would also help on the trial side in addition to anything else there. So I think put together, I think we can get to these therapeutics faster, cheaper, and hopefully better. Yeah. And maybe Pishmeat, we could tackle that directly. If you could give a sense for folks who aren't these researchers who aren't already leveraging these tools, how much does it really cost if someone does want to get a protein structure prediction or use some of the other models that we've talked about again graphcast or materials etc like what cost are we really looking at

Starting point is 00:36:54 yeah so for the alpha pool database it's literally free you just go to the alpha phone database and sort of find the protein that you're interested in out of the 250 million proteins and click it and it's there it's for free for everyone on the planet to use so really it has democratized things in a way that scientists in Latin America or India who was working on sort of neglected tropical diseases, for instance, who had no way they could get a structural of a protein that they were interested in can now get access to these structures at the sort of click of a button. Of course, a lot of research needs to be done to take that work and towards a more focused outcome and a lot more investment is needed if you are trying to finish and accomplish

Starting point is 00:37:47 the vision that Vijay outlined the other four structures are start but you really need to think about how does it bind to the ligands how do you do the ligand design how do you solve the co-holding problem so there's a lot of investment that is needed to make these models and make these predictions and refine them for specific applications and we have a spin-off from deep mind isomorphic lab which is now investing in this area as well. At the same time, we are continuing and work on the foundational sort of side of things and have now released an announcement and update on the next generation of AlphaFold, which goes beyond proteins to other biomolecules to nuclear acids like DNA, RNA, BDMs, small ligands, and so on.

Starting point is 00:38:34 I think it's amazing that you've opened this up to the community. And I think something I'd love to hear both of your takes on is, really the relationship of these models and them being open sourced. I mean, it's a big debate with an AI at large. But I think especially when it comes to science, there's, I think, both ends of the spectrum in a way, right? I think there's nothing more that people get excited about, about this idea of curing cancer, like solving poverty and an agriculture crisis. But at the same time, people also get very scared, right? I think that's where people's sci-fi nightmares come to be right where they're like, oh, someone can engineer a molecule that can kill us all.

Starting point is 00:39:15 And I guess starting with you, VJ, what's your take on this relationship of AI and science and why it should be open source? I think the beauty of open source, and we see this open source for AI and biology, but AI more broadly, is that people can build on top of each other. And I think what's really remarkable about the AI field, I would say, over the last five, maybe, possibly 10 years is that it feels like an amazing result comes out like, once a week. And that the key part of that is that it comes out with code or GitHub repo and that you can check out immediately. You don't even have to just believe the results you can run it

Starting point is 00:39:50 yourself. People have even open sourced for tests of things. So essentially we're building like a skyscraper where each person builds a new floor and we're going up really fast. And that's what open source can do. In the past, if it wasn't open source, I'd have to read the paper. I'd have to code it myself, and sometimes the paper may be a little vague for some detail. So I might not bother, right? And I'll just go do my thing. And so I think what open source allows us to do is to build on top of each other and build rapidly. Now, certain parts won't be open source. I think you unfortunately can't open source a drug compound because then no one's going to pay for the trial and certain things like that. Just economics doesn't make sense given these hundreds of millions

Starting point is 00:40:31 of billions of dollars and so on. So certain parts will be closed source and there's hundreds of startups and AI in biology and AI drug design that will maybe take advantage of what's been done, develop their own methods and build on top. And then that's where I think the drugs will come from. You talked about also the concern for how, because this is so powerful, we could maybe do a sort of dangerous things with it. And that's where I think there's a bit of a misconception because actually there's a huge asymmetry between the complexity of drug design for treating disease. And that's a really hard problem to do. But actually it turns out to be really easy to come out with chemicals that actually are dangerous and toxic. In fact, that's why we have

Starting point is 00:41:11 phase one trials, because even the things that you thought would really hopefully not be toxic at all turns out to be toxic. So it's actually very easy to make toxic things. And Google will teach you actually how to get rice in and how to get all this other stuff for better or worse. So I think there the asymmetry is that if we get rid of AI for drug design, you lose all the good and you don't prevent any of the bad, which is already here. I think that. I think that. That's a good point that a lot of people don't think about. But Schmidt, maybe you could just speak to why DeepMind has chosen to open source these models, which isn't necessarily the norm across different AI companies.

Starting point is 00:41:48 There was a lot of deliberation within the team and within the company on this. I think there were a few different things that went into that final decision. One was we wanted to, like Alpha Fold was that foundational there. It was so foundational. it would, if we had kept it close source, the impact of it, like fully leveraging the impact for society, I mean, that would have been difficult. It was because it's so fundamentally sort of foundational. It's very hard to even predict what are the potential sort of applications of it. Just to give you an example, when we launched Alpha Fold, a couple of days later, somebody

Starting point is 00:42:27 did an analysis on the uncertainty associated with the Alpha Fold predictions and figured out that in fact, Alpha Fold was, even though it was not trained for that, was the best predictor for predicting disorder in proteins. So that was something that we would not have come up with, right? If he had kept it close to, someone basically interacting with the models in the community figured that out. So when we were thinking about it, there was, of course, how to maximize social impact and scientific impact of the model. The second one was responsibility. we consulted a number of experts from structural biology, from chemistry, from drug discovery, to figure out what is the right and responsible and safe approach here and even considering

Starting point is 00:43:17 the malicious sort of use cases. And after we had done all the due diligence that we felt that this was safe to release and the impact of releasing it and open sourcing it in a wider sort of way would outweigh any costs that we would need to sort of model, it was decided that we should open source. And I think the decision has been validated by the impact that Alpha Hold 2 has had in the community. Now, of course, that's not true for all the different models. In fact, subsequently, we have had models which we have not open source. But I think in the case of Alphopold too, the decision was very, very clear in favor of sharing it with the world in the most freely way possible.

Starting point is 00:43:58 for the ones that you haven't chosen to open source, if you're willing to share, how do you make that decision? There are a number of different factors, both what would be the social impact, the scientific impact of releasing things versus what is the commercial cost of releasing something while leveraging it for commercial purposes or even the safety sort of argument. So just to give you an example, one of our recent models that we announced last year was Alpharmiscence. and this is a model for predicting effect of mis-sense variants. And what the model does, it produces state-of-the-art accuracy in making predictions about whether mis-sense variants are benign or could be pathogenic. And in this particular case, we felt that the predictions of the model for the human genome, for the human mis-sense variants, like the 71 million of them,

Starting point is 00:44:53 If we release that, that would serve most of the purposes that a clinician or a biologist would be interested in. So we just released the predictions rather than the model because the model had many other sort of uses. You could run it on different organisms. There were other sort of commercial considerations. So it was felt that we could release the predictions. We could share the methodology, but we will not sort of open source the approach. That makes sense. And I think at the very outset, you shared so many different projects or areas of scientific study that your team is working on. I'm just so curious because it sounds like there's been success across many. Are there any areas of science or mathematics that you've tried to address with this approach of using machine learning and AI that's not quite working,

Starting point is 00:45:40 whether it be because we don't have the prior dataset as VJ has spoken to that sets the foundation? I'm just so curious if there's limitations emerging in any of these fields that your team is running into. One specific area that I would love to have impact on, right? And I think I would eventually have impact on is systems biology. It's an incredibly important sort of problem to really understand at the system level how biological systems behave. It's just the data and the evaluation is not at a place where it is for maybe genomics, or functional genomics or for structural biology.

Starting point is 00:46:19 Before we actually start an initiative in any of these areas, there is a huge due diligence process that we need to undergo. Because essentially, you're making a very long-term commitment and the careers and the impact of some of the best scientists and engineers that we have are being committed to that area. So we take that responsibility very seriously. And only when the impact, when we are confident of the impact of the problem, we are confident that we have a good, valuation metric to track progress, and we have the raw material, the data or a simulator, to get good data. Only then do we make that long-term commitment towards a specific topic. To highlight the data issue, I think one of the biggest differences between AI for, let's say, language models or AI for video and AI for biology or for healthcare, is that I think most of the

Starting point is 00:47:18 interesting data in biology and healthcare is either dark, that there's all these medical records and so on that you just get access on the internet, which would be very useful for understanding the healthcare side, trial side, and so on. It's either dark or it's never been measured. And that, oh, we need to do the experiments. I think having the data could be paramount. And that I think that's going to be different than other places. The other places, maybe the algorithms can really drive things because everyone has the same data more or less. I think here people will be differentiated by their data. And so the innovations will be innovations and AI combined with innovations and data collection. And there are obviously things that interface are active learning and how can use

Starting point is 00:47:58 the data more efficiently and so on. But the data game, I think, is going huge. Absolutely. And Vij, I'd love to just get your take. You've spoken to a few examples already. But what different areas do you wish that more attention was being allocated? Or do you just think there's a set of grand challenges that can and will eventually be solved with some of this technology. The fun thing about Casp, this critical assessment of structure prediction, is that I think it also inspired all these other prospective trials and prospective studies. So there's a ton of that stuff to do. And I think there's a test for predicting binding of small molecules.

Starting point is 00:48:33 I think we'll see in time these types of methods do extremely well in those assessments. But the Holy Grail is, in my mind, being able to predict clinical trials. It's something where, you know, to understand how a drug works in human biology. And that's a pushme's point is that's a system's biology problem at the largest scale. And so that is the holy grail. And I think we'll probably do it in parts. You could imagine even like models for specific organs or models for specific parts of the body. And then we put them together.

Starting point is 00:49:02 Mixers of experts is pretty common these days. And maybe that would be one approach. But however it gets done, once that gets done to the point where these models are better than the animal models, I think that's where there's really going to be a tipping point and a point where we can just move much, much more rapidly, where we can sort of not get stymied with having to run these animal models, which takes a long time, and it's very expensive. And even there's crazy things. Like right now, there's a monkey shortage because monkeys are in such high demand to run these experiments. So I think there's a probably long road to get there where these models of humans are more predictive than the alternas. but I think once we get there, that will be a major inflection point. Wow. I did not know there was a monkey shortage, but, I mean, it really is important to know, right?

Starting point is 00:49:47 As in to your point, hopefully we get to a future where some of the things that we're doing in research today seem just so incredibly outdated because we just have better options. Pishmeet, what's next up for deep mind in terms of areas of interest? I mean, you're already working on so many things, but we'd love to just get a pulse on what's exciting for you, too. I think what is fascinating about science. And like in any of these fields, is that there's so much more to work on. I mean, even on structural prediction, I just mentioned that the latest version of the output, the work there is on extending it to general biomolecules like DNA, understanding RNA, understanding the interactions between small molecules,

Starting point is 00:50:29 figins and proteins, like bigger complexes, antibodies. There's so many things that we can extend in genomics. We have worked on both gene expression. the coding part of the genome, like with the Bissens variance, and the non-coding part of the genome, right? Or like predicting gene expression, we have made progress, but we are not completely at the end of it, right? So there's a lot that we are doing in all these areas, in material science. You mentioned this model known, which was able to predict 400,000 novel stable compounds, which expands the number of stable compounds known by more than the order of magnitude, right?

Starting point is 00:51:08 But how do you now take those sort of compounds and then a reason about their specific properties that would be useful in a particular application? So in any of these disciplines, we are not targeting one specific milestone. You're just saying here is a topic and the long term sort of roadmap is to think about a paradigm shift in how science is done in that area and move towards a more rational modeling-based approach. and tackling some of the problems that are encountered here. So there's a lot that needs to be done. And we are just trying to focus on specific areas and then new areas come up if the raw materials are there in terms of data. And if we are clear on the evaluation metric, we are constantly reviewing them as well. That's amazing.

Starting point is 00:51:56 I haven't done as much research as UVJ, but I did do a summer of battery research and materials research where we were trying to discover new sodium ion transition metal materials. And my summer was literally, I mean, this was when I was in college, so I wasn't very advanced, but it was literally like finding a paper that documented how to synthesize this material in the kiln, mixing it up, creating a little battery, doing it in the glove box and running it and just seeing how effective it was. And obviously, in many cases, it was very ineffective. But every so often we found a material. It was truly just trial and error, trial and error, trial and error. and when I see papers like this that do things in a completely new way at scale, way cheaper, you don't have all of these university students just in a glove box day and night. It's so exciting.

Starting point is 00:52:44 The end point for me is like, as we talked about, we're kind of in the middle of this journey and this technological journey, this cultural journey, these cultural shifts, and that it's going to feel like the big goals that I've laid out, let's say, clinical trials things and assistance biology, that's so far off, right? And it's going to take a while. But we can get a lot done in 10 years, collectively, 15 years. You're thinking about where we were five years ago, 10 years ago, 15 years ago. Now, 15 years ago, people weren't really even talking that much about deep learning or just beginning.

Starting point is 00:53:14 So the goals that we have are lofty, but I think we're right in the think of it. And all of it, I think, is very doable. It's just now building that tower one step at a time. It'll be fun to have this chat again in five years. Hopefully sooner. I think one sort of thing that has been very exciting in the last few years is the Of course, there's a lot of excitement about LLMs and foundational models and so forth. And if you look at the impact that's going to have on science, now in most of the projects

Starting point is 00:53:44 that I was talking to you about, we were working with structured data, data either which was collected or in the case of some of our fusion work, data that was simulated. But with the rise of foundation models and LMs, that opens up the possibility of now using unstructured data to feed these models. And so that really opens the door for a large-scale ingestion of scientific knowledge into the models. And that is a very exciting direction that will, I think, bring a number of other problems now in the feasibility zone, which previously were not there. Of course, there are challenges with understanding uncertainty and sort of hallucination and all these sort of technical problems need to.

Starting point is 00:54:31 be sort of addressed. But once that is done, I think the impact that's going to have on models for scientific discovery would be amazing. So that's another reason to be excited for the future. Absolutely. And all of the problems you just mentioned are also opportunities for people to go and fix and be a part of that whole ecosystem. So this has been really wonderful, Pishmet, VJ. Thank you for, as you said, getting people excited about what's to come. Because I think these two fields intersecting. What a time to be alive here in 2024 to kind of be a part of it. Like you said, Vij, we're in our equivalent 1920s, so hopefully people in the 2120s

Starting point is 00:55:08 will look back at this fondly. Absolutely. Yeah. If you liked this episode, if you made it this far, help us grow the show. Share with a friend, or if you're feeling really ambitious, you can leave us a review at rate this podcast.com slash A16C. You know, candidly, producing a podcast can sometimes feel like you're just talking into a void. And so if you did like this episode, if you liked any of our episodes, please let us know.

Starting point is 00:55:37 We'll see you next time.

The a16z Show - Can AI Advance Science? DeepMind's VP of Science Weighs In

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.