No Priors: Artificial Intelligence | Technology | Startups - Virtual Cell Models, Tahoe-100 and Data for AI-in-Bio with Vevo Therapeutics and the Arc Institute

Episode Date: February 25, 2025

On this week’s episode of No Priors, Sarah Guo is joined by leading members of the teams at Vevo Therapeutics and the Arc Institute – Nima Alidoust, CEO/Co-Founder at Vevo Therapeutics; Johnny Yu,... CSO/Co-Founder at Vevo Therapeutics; Patrick Hsu, CEO/Co-Founder at Arc Institute; Dave Burke, CTO at Arc Institute; and Hani Goodarzi, Core Investigator at Arc Institute. Predicting protein structure (AlphaFold 3, Chai-1, Evo 2) was a big AI/biology breakthrough. The next big leap is modeling entire human cells—how they behave in disease, or how they respond to new therapeutics. The same way LLMs needed enormous text corpora to become truly powerful, Virtual Cell Models need massive, high-quality cellular datasets to train on. In this episode, the teams discuss the groundbreaking release of the Tahoe-100M single cell dataset, Arc Atlas, and how these advancements could transform drug discovery. Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @Nalidoust | @IAmJohnnyYu | @PDHsh | @Davey_Burke | @Genophoria Download the Tahoe Dataset Show Notes: 0:00 Introduction 1:40 Significance of Tahoe-100M dataset 4:22 Where we are with virtual cell models and protein language models 10:26 Significance of perturbational data 17:39 Challenges and innovations in data collection 24:42 Open sourcing and community collaboration 33:51 Predictive ability and importance of virtual cell models 35:27 Drug discovery and virtual cell models 44:27 Platform vs. single hypothesis companies 46:05 Rise of Chinese biotechs 51:36 AI in drug discovery

Transcript
Discussion (0)
Starting point is 00:00:00 Hi, listeners, welcome back to NoPriars. Today we're here with the CEO, CTO, and core investigator of the ARC Institute, as well as the co-founders, Avivo, to talk about their release of the Tahoe 100, the largest single-cell drug perturbed data set ever created, as well as where we are in AI for biology, why we need a virtual cell model and not just protein structure prediction models, and when we should finally expect to see treatments from this growth of use of machine learning in bio. Hi, I'm Johnny, and I work on single cell RNA sequencing at Vivo. I'm Nima. I'm one of the founders together with Johnny. I'm a quantum chemist by background,
Starting point is 00:00:46 but I've converted to being a computational chemist that loves playing with biological data. and we're building Veevo to really do that to predict how chemicals interact with cells in different biological contexts. Some people call it the virtual cell. That's basically what we're working on. I'm Patrick Sue, one of the founders at the Ark Institute, which is working at the interface of biology and machine learning
Starting point is 00:01:10 to try to understand and one day treat complex human diseases, which are most of the major killers. I'm Dave. CTO at Ark Institute focused on computation and biology and building novel AI models for biology. I'm Honey. I'm a core investigator at Arc. I work very closely with Dave and Patrick to push our virtual cell initiative. Congratulations, everyone. It's a big day. Let's jump right into it.
Starting point is 00:01:36 What is the Tahoe 100 and what is the significance of it? So Tahoe 100 is the world's biggest single cell RNA sequencing data set. It enables basically a ton of machine learning applications, including things like the virtual cell. but it also enables a lot of drug discovery applications. And broadly in the context of where I think we are as a field, it's kind of the beginning of a different way of doing drug discovery, of basically understanding how to build medicines, and basically bringing AI machine learning people into the mix.
Starting point is 00:02:08 Maybe something I would add there as well. Over the last 20 years or so, people have accumulated a massive amounts of data points may come to protein structures, protein function, how drug molecules interact with proteins. But one thing that we haven't had as much is how different cells behave in different contexts and how different genes within each of those cells
Starting point is 00:02:30 actually functions in the presence of the other genes, you know, in these different biological contexts. This, we believe this is the era for that right now. You have seen the emergence of protein language models built on the data cells that have been accumulated over the last two decades. But now is the era for actually having data on cells, how they function, how they interact with drugs,
Starting point is 00:02:49 drug molecules. And exactly what John is saying, Taoh is really a landmark data set there that allows us to really measure how drugs interact with different cells from different patient models. And that gives us the ability to build similar models that we built in protein language models, but in the cellular kind of context. If you think about it, actually, like, in history of AI, there's, you know, it's punctuated by the data sets that come about, right? Like, if you think about ImageNet in 2009 that Fifee Lee put together and you look at what that did to, drive a sort of a non-linear jump in machine vision.
Starting point is 00:03:23 I think the hope here is that by producing data sets, particularly perturbational data sets that allow us to elicitate, you know, cellular responses that we'll be able to actually drive forward the ability to model at the cellular level, not just at the protein level. And so I think this is one of those moments, hopefully. Yeah, so lots of people have been talking about what those foundational data sets look like for biology, right? And this has been really useful for training protein structure prediction models like alpha fold built on casp, the competition built on top of PDB data. But how do you
Starting point is 00:03:58 do this for cells and cellar dynamics, which is really what tells us about biology and how it responds in health and disease? So I think those are the core steps forward where we want to bring up our ability to study higher levels of abstraction in biology, not just the individual molecular machines, but how they operate in the context of an entire cell. Congrats also to the entire ARC team. Given you are working on both virtual cell models and protein structure prediction, protein language models, can you contextualize a little bit why we need both and where we are in the progress of each?
Starting point is 00:04:32 I think we're learning that, right? We're looking at these emergent properties of biology by training these, you know, large-scale foundation models on nucleic acids and these virtual cell models that will talk more about today. And, you know, we have this debate often internally. So, you know, I have sort of an engineering computer background. So the way I think about it is, you know, if you think about the cell, the DNA lives in the ROM, the read-only memory, right? So it's coding for the cell. But then the RNA lives in the RAMs. It's like the working memory. And the RNA is constantly changing its expression level. It's almost like one of those
Starting point is 00:05:05 1980s graphic equalizers where you go like 20,000 like bars for each gene. And it's constantly adjusting its expression level depending on what the cell is experiencing. whether that's sort of the environment, whether it's stress, whether it's aging, whether it's like disease state or healthy state. And what we're trying to do with this data, I think, as a field, is create these virtual cell models, which in a way is kind of inferring a notional CPU for the cell. So, like, how does the cell respond to an input?
Starting point is 00:05:32 That input could be, you know, an edited gene. It could be an application of a drug. And then how does that reflect in the transcriptomic profile? And so that CPU is sort of an analogy to the AI model that you want to build. And then once you have an AI model, what's really interesting is you can start posing the inverse question, which is, you know, given a cell in a certain disease state that's exhibiting a certain transcriptomic profile, how do I perturb that cell, whether that's a gene edit or that's a drug, to perturb it back into that healthy state?
Starting point is 00:06:01 And I think that's what's really exciting about this data, which then enables these models, which then enables these tools and hopefully can accelerate drug discovery. And one thing I will quickly add to that is that when we think about, different domains in biology, I think, and building AI models of those domains, there are parts of it that we are data poor and there are parts of it that we are compute limited. I think when it comes to, for example, DNA language models, again, thanks to the field and decades of having sequenced a ton of genomes, we are not as much data limited, but compute that in specifically context and how long we can actually consume DNA and what size of inputs and all of that is actually a big limitation that we have tried to solve.
Starting point is 00:06:52 But when it comes to kind of cell estate models, that is an area that we are absolutely very much data limited because being able to profile cells at single-cell resolution is basically a new technology. you know, has emerged over the past, you know, decade, but really kind of the explosion over the past five, six years, and we are just getting there to be able to generate that kind of data at its scale. And it's not just a scale, maybe the one thing of, it's not just the scale. I think the idea here is that we have, I think, in the order of,
Starting point is 00:07:27 before SC-based camp, which is the data set that's being released together with Tahoe on the Virtual Cell Atlas, that was created by the arc folks, by basically collating all of publicly available data. Before that, I think the number of human cells that we had been collated together, it was in the order of 45, 50 million, if you're generous, 16 million single-cell data points.
Starting point is 00:07:49 But the scale is one thing. The question is, you know, how much information content there is in this data as well? Quality, yeah. And are they coming from very different biological contexts? We actually built, you know, early versions of some of those virtual cell models, We call them SingaSelf Foundation models
Starting point is 00:08:06 or whatever you want to, whatever name you actually use for them. And what we saw is that if you actually reduce the number of the 16 million, you downsample it by like even 99%. You just use 1% of that data to train your models. Actually, the model's performance doesn't reduce that much. So it means that the information content of the models that you are actually, or the data you're using for training those models is not amazing. So having data that comes from very different biological context,
Starting point is 00:08:33 that's very key in providing the information content for the model so that they can learn. And that goes back to what Dave was saying, the perturbational data sets. Perturbation allows you to create new context, allows you to create new cell states that then the model can learn from and therefore be used for different types of applications. And then I'll let Johnny later talk about maybe, like, what is the challenge with this perturbation, creating this perturbational data sets? Before we go there, actually, can we zoom out for a second and just have you described in layman's terms, what the data actually tells you and where the prior data came from,
Starting point is 00:09:08 even if it was information poor. If you look at the data that's been generated over the past decade, it's basically all kinds of academic groups like us or some people in industry generating all these little data sets. And there's a ton of problems with this. First, there's batch effects. So even one person running an experiment on two different days, their data looks the same, even if it's the same cells.
Starting point is 00:09:30 So when you think about trying to, to build the internet of biology, which is what you need to build this CHAP-G-T moment. In terms of scale. In terms of scale, right, because you need big data. Machine learning is not going to do anything for us if we don't have big data. You have a data set that's poorly labeled, that's super batchy, that's maybe moderately useful for AI, but it's not there. And so this data set, it is basically doubling the size of all the data that's out there
Starting point is 00:09:56 cumulatively over the past decade. It covers 50 different cancer models from different patients, so it sells from 50 different patients, 1,200 drug treatments. So it's a really deep and rich data set that effectively has no batch effects. And so we think this is actually not only an additional data set for machine learning, we actually think it's the first data set that's going to enable machine learning in this space. One thing that might be worth touching on is, you know, why perturbational data, right? And I think the key is that we're going from correlation, which is what a lot of biological research is, it's descriptive, right? You kind of stare at things.
Starting point is 00:10:32 you try to see when you poke this way what else is changing and go from associative changes to causation, right? And that's where going with genetic or chemical perturbations allows you to have a very clear before and after where you have the set, you know, of causal changes that can actually drive a particular cell state. The key is to be able to do this in a generalizable way. So you can look across many different cell types, many different tissue types. A ML model would need to in order to learn a general sense of cell-state possibilities, we need to then train on that diversity of data as well. I mean, in a topological sense, what the model is trying to do
Starting point is 00:11:12 is trying to create a manifold in a high-dimensional space, and it's a high-dimensional latent space. And so actually explore that manifold, the model needs to see lots of different perturbations and responses. And then once you do that, you have this generalized manifold that allows the model to make predictions for data that it hadn't seen in its sample that still fits the manifold. To make it even more tangible, the data that was available publicly before this,
Starting point is 00:11:37 almost the entirety of it comes from healthy tissue. Very little comes from actually diseased cells. And almost all of it, not in the entirety, almost all of it is observational in the sense to take, you know, cells from a liver sample, and you do single cell RNA sequencing on that. And that basically had the limitation that Patrick was talking about, you know, does it capture the causality of the gene-in-gen interactions you're trying to model? And the second piece is, does it allow you to model how then a new perturbation actually will impact the cells, whether it's genetic perturbation or drug perturbation, which really is the focus for Tahoe in this situation, perturbational data sets.
Starting point is 00:12:14 So in that sense, like, Tahoe, I think, when you put all of the perturbational data sets in the world together, if you're generous, it's like one to two million single cell data points. I mean, this is publicly available data. We don't know as much about, you know, what's inside different organizations. A public level is 2 million, Tahu is 100 million. So we have basically increased that massively. Now, when you couple that with this huge amount of, you know,
Starting point is 00:12:38 observational datasets from different species that are in the world, which is basically what the archives did. They put together in the entirety of that data set, it turns out to be 200 or 230 million single-cell data puts already out there, and they have tried to reduce as much as possible the variations between these datasets so they're consistent with each other
Starting point is 00:12:54 so they can't train machine any models on. That's the significance of this day. I want to make a finer point on this, right? I think the key is if you want a model that can learn about changes going on in the heart or in the brain or in the liver or in the bones, right? You need to be able to train across all those different cell types. But if you just look at normal healthy cells, right, you wouldn't necessarily learn about how the manifold in latent space changes in disease, right?
Starting point is 00:13:19 And so being able to look at many different types of tissue types across different cancers is one way to be able to get at those really, you know, critical. disease states that both basic science but drug discovery really cares about. How should we think about 100 million data points or 230 million data points and the scale of this release in terms of where we are? Is that enough to be useful? What do we know about scaling laws now? It's a short answer. It's a very, very hard question. We won't know until we hit there. What we can draw inspiration from is basically large language models, you know, in human language
Starting point is 00:13:57 and also things like DNA language models, where we do have enough data to do scaling laws. And where we are around there, you know, you're around one trillion training tokens is where you want to hit, right, by and large. Like GPT3 was, I think, half a trillion tokens. ESM3 was 700 billion tokens, so close to a trillion. Yeah, so a trillion sounds like a comfortable mark to hit.
Starting point is 00:14:25 So then the question, becomes, how do you count tokens and because, you know, cells in the end are not exactly sentences. But, you know, our genes and their expression, if you count them as tokens, I think this, this collection that we have put together, I think gets us close to where we want to be to start asking and answering those questions, actually. So I think, you know, puts us a few hundred billion training tokens for the kinds of model architectures that we have now.
Starting point is 00:14:54 Think of a cell collection of, for this data says 2,000 to 5,000 genes. And each gene and its expression is basically a token in what we're doing. So 200, like 100 million single cell data points is akin to 200 to 300 billion tokens. Now, there is a finer point there, which is like how much of these, how many of these tokens are actually informative to the model. I'm not asking this question the correct way, but you will understand the gist of it. How do you decide where in the genetic landscape to start? How do you choose perturbations?
Starting point is 00:15:27 I think you want to match, and this goes kind of the same with drugs, is you want to match your quest, your like perturbation toolkit, which is like the kind of arrows you throw at the biology against the biology you have. So for cancer, that means going after cancer relevant genes, genes that impact growth of cells, genes that impact DNA regulation, and also drugs that target key cancer pathways. So I think for cancer relevant questions, but this data set, even though it's heavily based around these kinds of chemical perturbations
Starting point is 00:15:56 and cancer, they also, these pathways are so conserved and fundamental that they broadly apply to the neuroscience base or like to just immune cell development in general. So I think it's really the foundation model that's going to be able to take this data, ingest it, build a model, and then train and then understand basically how to like translate that data to a different context entirely. Yeah. So this is the key, I think this is one of the really special things we have at Vivo, and it's this mosaic platform.
Starting point is 00:16:28 So it allows us to take cells from many different patients. And then in cancer, this means all kinds of cancer, lung cancer, non-ancreatic cancer, et cetera, et cetera, from different patients, which have their own special genetics and pool them together into a single mosaic tumor, which we then can reproducibly screen hundreds or thousands of drugs against. And so this key innovation basically allows us to, instead of test one cancer model at a time, tests tens or hundreds. And it makes this a really scalable data generation platform.
Starting point is 00:16:57 This is what we use to generate Taha 100. When we think about actually how we build these pools in terms of information content, we want to maximize it by covering a lot of cancer patients. So this dataset, we covered the biggest cancer types by how frequently they occur annually. But then as we continue to grow this data set, we want to think about rare. disease bring in maybe more coverage of different parts of the answer space informed by the machine learning that basically will help us fill in the gaps in the foundation models another direction is chemical space so I think one if the
Starting point is 00:17:32 question is about you know how do we prioritize but frankly we have when you generate five we then generate 50 times more perturbational data sets and it's probably available in five weeks and that those data sets have been I've been generated over 10 years, you don't have to prioritize as much. And that's the beauty of it, in my opinion. You know, you can go large on the chemical space. You can go large on the patient sample space. And that way, you don't have to really, a priori come up with the hypothesis about, you know,
Starting point is 00:18:03 what is it that I have to feed the models. You can just generate as much as you want. You can be more unbiased as scale increases. Exactly. Hypothesis-free, unbiased kinds of, you know, data generation. That's really, I think, the beauty here. Yeah, let the data surprise you. that the data surprise you, exactly.
Starting point is 00:18:17 And this is, like, one of those things that I'd like to talk about as well, and I hope to, these are people we have here are the representatives of the new generation of biologists. But I think one thing that has been a, has been slowing the progress in bio is the fact that we have always been super hypothesis driven. And I think it has, the reason is that a lot of these experiments are expensive, you know, they take a lot of time, a lot of resources.
Starting point is 00:18:45 But I think now is really the time that the sequencing costs has gone down, single-cell sample prep costs has gone down, compute costs has gone down. I think it's also time to change that kind of mentality in bio as well and be a little more courageous, you know, be a little more free-willing in terms of your data generation and the kinds of, you know, samples you put together. So, yeah, I think this is a view from an outside there. I want to talk about being more ambitious in bio and the open sourcing of this in a second. But I think we should just zoom out and talk about.
Starting point is 00:19:15 about in layman's terms what the platform does, and you can correct me if any of this is wrong. So you have these tumors that are a mosaic of cells from different patients representing a huge amount of patient genetic variation. And each mouse then can actually be treated with different drugs where the signal you extract after is the interaction of drugs against each.
Starting point is 00:19:45 of these different patient types. That's right. Okay. Nobody else thinks this is crazy. Not crazy because it's happening every day, but it's really science fiction, honestly. Great, I'm just trying to, like, boil it down to a very, you know,
Starting point is 00:20:00 simple non-biologist's understanding of when you say it's a platform with this super tumor where you can pull all of this data out, it is wild to think about how efficient that is in comparison to, well, we will observe, you know, one patient type at a time. I think this is actually super interesting point. interesting point. If you map the number of tokens per experiment across the last 50 years
Starting point is 00:20:20 of bowel medical research, right, it'll look like the hockey stick that, you know, all investors and founders really know and love, right? Just going up and to the right. And I think the way that we think about doing science is changing, right, based on this. And there's, I think, a roiling discussion today about hypothesis-driven versus hypothesis-free research, right? Should we be doing mechanism versus large-scale profiling. But honestly, I think this stuff is going to wash out with scale. Exactly. You don't have to choose between those.
Starting point is 00:20:50 Yeah. Maybe that's my hot take with this era of machine learning and biology is the vast majority of mechanistic data that's been generated to date is really made to ask very specific, very well-scoped questions and just, you know, way more tokens per experiment
Starting point is 00:21:08 is just going to be the way to do it. I mean, maybe I can say in another way. I think in biology, what we have done is we have treated humans is the foundation models that ingest information and come up with hypothesis, right? And, you know, but now we actually want to go beyond that because humans, of course, come with their own, you know, intuitions and biases and all of that.
Starting point is 00:21:33 At UCSF, for example, we often ingest say that, you know, we use some of our medicinal chemist folks, like Kvon Shokat has kind of his last layer of a neural network, right? They have built this intuition of, you know, is this chemical that I, you know, I generated via this AI model, does it actually look something that is real, right? Yeah, you know. Yeah, and they can't even like verbalize why they think it might be a good drug or not. People criticize these models for hallucinating.
Starting point is 00:22:03 But if you think about it, the process of scientific research just involves hallucination. Right, that's what creativity is. Yeah. So you're all adherence to Sutton's better lesson in this field as well. The intuition being baked into the models or the process is not the right thing. We just need to scale data. I mean, at least we hope that you don't have to make that choice here, you know. We're seeing evidence of scaling laws in biology across proteins, right?
Starting point is 00:22:31 That's been shown in the protein language models across DNA, which is what we've shown in our Evo series of models from ARC. We're also seeing inference time-scaling laws, which in our most recent study. So there are sort of early signs of promise, although, you know, we'll need good benchmarks and we'll have to look at this across different data types of time. And the funny thing for me is that if you've been in this field for long enough, and I come from the quantum and computational chemistry side of things, every time that you take a certain success from field A and you want translated to field B,
Starting point is 00:23:05 a lot of people, including in our own organization, organizations, they come up with a list of 100 different reasons why field A, the learnings are not applicable in field B. But then you get surprised every time, and then the next time when you're trying to do the same thing with B to C, you know, the same kind of list actually start emerging. In a way, I think something that's underappreciated is that those same models that learned human language, they are learning the language of structural biology. And then with the EBO work, they are learning the language of DNA, you know. And this is incredible.
Starting point is 00:23:42 I think it's not trivial. And by the way, again, if you have been in the field long enough, you know that there were a lot of people who were saying, no, no, this protein language models are never going to work. You need to have domain-specific kind of models to model this kind of phenomena. So in this way, I think this is really the ethos that we have to bring here. When Honey was saying that we should use the learnings from those models to actually translate here,
Starting point is 00:24:07 That's exactly what we should be doing. We should be thinking about what worked and at least try it in these new domains. This is the domain we are talking about and the domain Vivo is excited about and also the virtual also, you know, part of ARC is excited about is the language of systems biology. The first thing you should be doing is to try out the things that worked in the other domains in this domain. Maybe it works, maybe it doesn't work, but if you don't try, you'll never know. Music to my ears, given this is one of the only things we have really strong conviction about,
Starting point is 00:24:35 at least at the fund we invest out. of that, you know, a bunch of these techniques, they work and they scale in domains where they, you know, where people are not sure yet, quite generally, where we wouldn't be expertise, we wouldn't have the expertise in the traditional types of discovery and company building. But actually, they seem to apply very generally, right?
Starting point is 00:24:58 I think this is a great segue to a question of, you know, you're open sourcing the data. Why do that? Yeah, so we generated the data at Vivo, and Vivo is a private venture-backed company, our startup. And so Johnny and I, when originally the idea of Tahoe came up and Johnny told me, yeah, there is this opportunity, we can generate 100 million single-cell data points? And I said, like, can we?
Starting point is 00:25:28 And I said, yeah, yeah, we can. And I said, okay, let's go and do it. And I think it was within hours that when we were chatting with, so for transparency, Johnny, Honey, and I are co-founders of Vivo. When we were talking about it, we said, okay, let's do it and let's open-source it. And why do we want to do that? Number one, we want to put a new stake in the ground.
Starting point is 00:25:48 We want to show that there's a new game in town and really it's possible to up our game as a community, as a field. And we wanted to show that so that people actually move on from this, a million single-cell data points, 100,000 single-cell data points, observation. We up their game and actually go to a much. more massive scale. So that's number one. Number two, we wanted to, so the DNA of our company is to be very, very small. A small team of superstars rather than, you know, hiring 100 people.
Starting point is 00:26:20 Paradoxically, open sourcing actually allows us to do that. In a way, like we, I think we talked to Dave about Tahoe. It was like the night before the New Year. Like, it was sometimes we think, Christmas and New Year. And then Dave got really excited about it. And then the whole our team got excited about it. If there wasn't the open source aspect to it, it wouldn't have been as exciting, you know. Like the whole community is getting excited about, you know, playing with this data,
Starting point is 00:26:47 telling us what's good about this data, what's not good about the data. And that basically allows us, you know, a team of like three, four people that we have in-house allows us to keep it that way and basically bring the entire community of like-minded people who have the same mission building virtual cells to help us, you know,
Starting point is 00:27:04 in this quest. And for us, it was the idea, the idea was we will remove the main bottleneck in doing that. And that's, I think everybody has been saying, that's data. And, yeah. I think the serendipity for this was, you know, ARC is all about mission-driven science and pushing science forward. And we were conceiving of creating what we're calling the, and then launching this week, called the ARC virtual cell Atlas. And so the idea there is really can we, you know, find high-quality curated data sets
Starting point is 00:27:33 and put it out there in the world to accelerate virtual cell modeling, and then we started chatting, and it was like, you've got what? And it was kind of incredible. And so what we're actually assembling, you know, this week is this new Atlas. And so, you know, the sort of the star of the show in some ways is the Vivo Taha 100 data set. We're also augmenting that with observational data. So we've created something called SC Basecamp. And you can almost think of it like the Google Croll or an index.
Starting point is 00:27:58 So we've built this agent that goes onto the Internet, and basically mine's post. public single cell sequenced RNA data and then curates it in a very sort of uniform way and results in a very nice observational data set. It's about 230 million cells. You add that to the 100 million cells from the tau 100. You now have 330 million cells. And so this is a really exciting resource for scientists around the world who are interested in modeling at the cell level.
Starting point is 00:28:25 And it's just very complementary, you know, to have this observational data set that you could possibly, you know, pre-train a model on, and then the perturbational data set from the Tao 100, which allows you to then to bring in those dynamics and make them model richer and more predictive. We're super excited about AI agents for science at ARC and I think across the community. I think the capabilities are still very early today, but I think we wanted to show an example of how it can do something really useful. I think it's very clear now that basically all dry lab workflows are going to get automated with agents or with co-pilots. and, you know, this would ordinarily be the type of thing that a team of computational biologists would be slaving over, right?
Starting point is 00:29:07 And our core insight was, well, you know, sequence read archive is the largest sort of repository of all biological data from next generation sequencing, right? You get an NIH grant, for example, right? You know, you sort of post all of this data online or you publish in a journal, you put all this data as part of the journal publication. But this is extremely fragmented, poorly annotated, really sprawling.
Starting point is 00:29:34 There's no requirements on your submission data in uniform. Exactly. It's very messy. And so, you know, we built this agent to basically crawl all of this data, collect it, organize it, process it. And in doing so, basically, isolates and, you know, kind of remove a lot of the kind of batch effects or data biases of previous methods. Yeah, I mean, one thing I would add is that the reality is that these data sets have been generated over time, you know, going back a decade.
Starting point is 00:30:07 So tools have changed, you know, versions of tools have changed. Genome builds have changed. So by just taking process datasets and collating together and collecting together, you are kind of infecting and contaminating your data with these analytical effects, batch effects. So our idea was to... These are like foundational data sets for the entire field, right? People work with and interpret and, you know, write papers on top of all of this data. Yeah, so exactly. I mean, our idea was that to at least remove that.
Starting point is 00:30:39 I mean, there's a lot of kind of technical experimental batch effects, but of course, over a span of this time, like, you know, chemistries of reagents have changed and all of that. But at least we do our part and remove the analytical component. And we were actually surprised to, you know, to what extent that was. was actually very observable in the data and removing it was actually quite helpful. Maybe on the Vivo side, this whole idea
Starting point is 00:31:05 of this infection of the data sets using, because of this massive batch of effects, like this is like the phrase. Maybe John, you want to talk about how many people actually did experiment? This is like the Tahoe. Well, it ended up being actually four people from Vivo. And we did it, I think, over like three days.
Starting point is 00:31:26 In the end? Think about the leverage. That's kind of not. Yeah. You know why that's super important? It's because sometimes I ask, like, Honey and Johnny, like, I don't know, what does drug A does to sell line X? And there's this word that biologists use in our hand.
Starting point is 00:31:41 It does so-and-so. And this is what, like, I mean, Dave, you tell me, like, we come from a different background. A computer scientist, we didn't say, like, in my hand, this model actually works. My computer. In my environment. It's kind of a thing. There is actually a parallel, but it's not great either. Exactly.
Starting point is 00:32:01 So I think that's the genius of, I think, what Johnny has built there as well. That, you know, this is actually done by very few hands. Automation is going to scale it to a certain level. You haven't even done much automation. And so in that sense, like that kind of thing, the beauty of that, you know, what Johnny designed in building Tahoe is exactly this. A few people, few hands, doing exactly consistent work, doing 60,000 experiments. It's 100 million single cell data points, but it's actually 60,000 drug patient interactions, drug cell line interactions.
Starting point is 00:32:31 And so, like, having been done with four people, I think that that's just, it reduces the infectious aspect of data set infection that Johnny was talking about, massive. So first in history opportunity for scientists and entrepreneurs to go work on this data set and create these virtual cell models. How do you tell the quality of one of these models? I mean, the core idea is it's what's predictive ability, right? And so, you know, you take a cell, you perturb it. You can do that either by, you know, from a genetic perspective, you can suppress or upregulate genes or apply drugs, and then you look at the response.
Starting point is 00:33:12 And so the measure of the model is how well it predicts the, what we call it, the differentially expressed genes. The reality is today the best, models are very poor at this. The predictive ability of the DEGs, as we call them, is in the order of 10%. And one of the... Is there an accepted benchmark for this today? No, but actually I think that's something else that the industry would benefit from. It's a good point. But, you know, if you think about where we want to go, one of our conjectures is that one of the reasons the models aren't doing well is not just simply model structure. We have a lot of rich structures that we
Starting point is 00:33:49 understand in the ML space, the issue is the data quality. And so the hope is with this New York virtual cell atlas with the Tahoe 100 that we now finally have a starting point where we can build rich models and get high predictive value of these virtual cell models. So that's why this is really kind of an exciting moment in time. It might be worth just also speaking plainly. Why do we even care about virtual cell models? We have real cells, right?
Starting point is 00:34:16 Why not just do experiments on those, right? And I think ultimately biology is very slow, right? You know, all of us in this room, and many of you watching, have probably tried to pick up high pets and move clear liquids from one tube to another and grow cells and make animals and deal with biology, which happens in real time, right? So, you know, and this is a funny story. In the last year of my PhD, my advisor tried to convince me to start an aging project, right, which would have involved, you know, aging animals for, you know, like two years.
Starting point is 00:34:49 You know, and that's sort of one experimental round. As you can imagine, I decline. I was like, you know, may I please, sir, graduate. But that's actually what happens, right? It's actually just our labor retention. Right, right. You're constrained by biological time, which is like completely crazy to me
Starting point is 00:35:07 coming from an engineering background. And really important to tons of fields like neurodegen or anything else that takes time to progress, right? Yeah, so, you know, the sort of massively parallelized in silico simulations, sounds great, but it needs to be accurate. It's 10% accurate. You're just simulating noise, right? And so, you know, how do we go from, you know, a discipline
Starting point is 00:35:30 that primarily respects experiments today to something more like physics where theory drives a lot of progress? And I think these virtual cell models are a core wedge in making that. Well, can you actually make that more concrete, then? Like, if these virtual cell models work, and, you know, we don't even know how to measure them yet
Starting point is 00:35:46 because they don't exist in any way that's productive today. But if they should, then what will scientists or the biotech field of patients expect to gain? Maybe from a drug discovery perspective, I can talk away then from the more scientific viewpoint, the arc folks. So what we are focused on at vivo is to predict how a new chemical entity interacts with cells from different patients or patient models. That really is the core of it. So Patrick was talking about in silico-simulation of this. Can I predict in a computer this new chemical structure, drugs are chemical structures, by the way.
Starting point is 00:36:26 I hope you won't get surprised by that. Whether this chemical structure is going to take the diseased cell, like a cancer cell, from a diseased state to a healthy state, or for the case of cancer, actually, to kill it, literally. If I can predict that, then my ability in designing new chemical that do that effectively, they don't, you know, they kill the cancer cell, but they don't kill the healthy cells, et cetera. That increases massively. And that's what we want to, that's what we want to do. And literally, that's kinds of data we are generating to train those kinds of models.
Starting point is 00:37:00 Anything to add to me on? Yeah, I completely agree. I mean, a big part of our future vision and roadmap is that we think there will be a moment where from a virtual cell model, a drug is spit out. And basically the drug will actually cause a healthy disease cell to become a healthy cell again. I think that's kind of the goal, and that will reshape how we do any kind of drug discovery. One thing I will add there is that there are two dimensions of generalizability to think about. One is basically a cell kind of dimension and then the chemical dimension. On the cell side, you know, every disease is unique. There are similarities.
Starting point is 00:37:40 There are trunks of cancer mutations and all of that that. drives the disease, but they're also very much individual variations. And you can observe cells from patients, but you cannot do what, for every patient, that every tumor that arises, what, you know, these folks do in Mosaic. So the idea is that you will, using a virtual cell model, you can take those learnings and then apply them to all of these new observations that you can make in patients. So that's one dimension. The other dimension is chemicals, you know, kind of in silico libraries,
Starting point is 00:38:16 you have like tens of millions of compounds and biologics, you know, infinite biologics, if you really put your mind to it. But most of these have never existed and will never exist because there's no use for them. So a model that can traverse that really massive space of, you know, chemistry to find, you know, which part of this you actually need to pay attention to and go and synthesize and check will be massively enabling because everyone else, you know, has libraries that are well behaving, you know, a couple of hundred thousand libraries and they use fragments and try to put them together. So the process of kind of how, you know, folks design drugs
Starting point is 00:38:57 is this a slow screening process. And this will allow us to kind of really leapfrog that entire pipeline. 90% of drugs fail in clinical trials. So, you know, we're pretty best. that at making drugs, right? And I think that implies two things. The first is maybe our drug matter is not very good in the sense that it's potency, its ability to bind the target, it's, you know, kind of toxicity, its pharmacokinetic profiles, all of those things, right? You know, sort of admit, you know, these types of things are not optimal. The other is we're probably drugging the wrong target, right? And I think, you know, the sort of of idea of these virtual cell models is that you'll be able to significantly cut down the search
Starting point is 00:39:45 space of what the right target is, and then you can actually, you know, really focus your time on making the right chemical or, you know, kind of chemical matter drug composition to actually make the right types of changes in the right types of cells, right? That's why mechanism and drug discovery are so, like, tightly interwoven, and, you know, that's really what we need these models to help accelerate. And this is super important because it actually created, this is the gist of why we need virtual cells in addition to this protein language models that everybody has been talking about.
Starting point is 00:40:22 I think I said it before, that protein language models speak the language of structural biology. How does a protein structure looks like and how does it fold? How does it interact with the... How do you dock a ligand? Exactly. Small molecule drug. Exactly. Or how does an antibody binds to another protein?
Starting point is 00:40:39 This is a binding question. You are binding in the sense that, you know, you are trying to see whether one chemical binds to another chemical. But biology is more complex. And I'm, again, I'm a computational chemist. I'm a quantum chemist. I wish, and I actually bet my PhD on building quantum mechanical models that from, you know, physics-based perspective,
Starting point is 00:40:59 go and simulate these kinds of bindings. But again, it turns out biology is a lot more, complex and there is a context to that protein target that we are trying to to hit you know it's part of a cell the cell is part of a for cancer it's part of a tumor the tumor is part of a broader biological system so virtual cells in my opinion are going to allow us to go beyond the language of structure of biology and venture into the language of systems biology and understand how how the drug is interacting with the broader biological system rather than simply just one targets that we are basically cracking the code on that already reporting language model.
Starting point is 00:41:37 Well, then I have a higher-level systems question. We're at single-cell. Like, what about multi-cell and aggregates and organelles? And, you know, is all that going to be possible in the future? Yes. I mean, I think, like, the first thing on the virtual cell, you know, direction is, like, or any modeling is, like, what's the right level of abstraction? And so I think our belief around the room is the right level of abstraction is at the sort
Starting point is 00:42:02 transcriptomic level because you have these very complex gene pathways and so whenever a cell is changing to its environment reacting that it will be reflected and is reflected in the transcriptome so I think so I think that's the first question even within a cell what's the right abstraction and so we think this this has like you know because like if you think about a cell it's like this very exquisite piece of machinery and like you know you could make it an arbitrarily complex model but we believe this sort of genetic level is the right the right level to model I think going beyond that yeah you can create
Starting point is 00:42:32 very advanced models. I think you see people doing steroids and organoids. So you take mixtures of cells and run them together and you try to simulate, say, you know, cardiac tissue or brain tissue. What's really interesting is, you know, maybe you have an organoid with, you know, 20,000 cells. You can then still apply these techniques that we're talking about, like take these drug perturbations and apply them to these cells
Starting point is 00:42:57 or these genetic perturbations and look at the responses. And so what's happening now is you're going beyond a single cell, but you're sort of getting the intercellular dynamics captured as well in the models. But I think it just naturally ladders up from single cell through to these sort of more multi-stall. One one small comment on that one is that it is a single cell that we are modeling, but that context dependency
Starting point is 00:43:21 also captures a lot of the effects that arise from the environment. So the models that we have are actually favorite models in this specific experiment for Tahoe. We also have in vivo models. humanized mice that you know they capture some of the immune system of the mouse so in a way yes you are simulating you're building an in silico model of a cell but if a model is any good it can simulate it in different biological context in the presence of this kind of immune environment in the presence of a tumor versus this other kind of tumor in the presence of this mutation
Starting point is 00:43:51 versus other mutation so it's we call it single cell but the whole idea of having so many single cell data points is that you have it in different contexts and yeah yeah yeah that seems really important nuance there. Yeah, the information of the environment is filtered through the cell. So if you're observing the cell with enough resolution, you can even predict. It should be represented in the model. You can also add spatial data. Oh, yeah, definitely. Okay, I have a few hot take questions to end with.
Starting point is 00:44:19 Nima, I will start with you because we were having a passionate discussion about why it was really important to you that Viva be a platform company versus a single hypothesis company like, you know, 99.9% of biotechs out there. What is the difference? I think the difference is the kind of team and the team you build and the ambition that you have, you know. It's a single hypothesis company is basically the idea that human being the foundation model
Starting point is 00:44:49 that Honey was talking about is basically we come up with a hypothesis and then we go test the hell out of it in different kinds of experiments. and we basically are very heavily incentivized to V. I mean, like a company that's built on that's hypothesis, they're very heavily incentivized to make that hypothesis work. What you see actually in biotech a lot of times is that you take a drug to the clinic
Starting point is 00:45:12 after you have tested it on three different patient samples, you know? If you actually are a platform company, what that means is that what you're trying to do is to have enough hypotheses and to have such a hypothesis-free way of generating new hypotheses that doesn't make you wedded to one hypothesis and therefore it allows you to be actually
Starting point is 00:45:32 a lot more scientific in your quest for new drugs or question for new targets to treat to treat disease. I think that's why I think the core of what and we had the law of hypothesis initially to go after and just build, you know, one asset, two asset kind of company. But we decided to make the platform company because it allows us to be a lot more rigorous
Starting point is 00:45:50 in terms of what we actually decide to take to the clinic. There has been a lot of news recently on a different question, which is the rise of Chinese biotechs, for the core members of the research community here, is that a threat? How do you think of it? Well, their cost basis is definitely more competitive, right? I think a lot of the discussion around the water cooler in the biotic and farm industry is, you know, how are they able to do it at this pace? Are they able to do it at this cost, why do their data packages look so good, right? They have safety, they have talks, they have all these IND enabling studies. You know, it's really competitive. And
Starting point is 00:46:34 I think folks got really surprised at the efficiency of the pipelining and the ability to manufacture all these different antibody primarily. And I think that's great for the industry, right? I think everybody, including patients, investors, you know, the biotech companies themselves want lower cost basis, right? We want the ability to actually make molecules that work faster. And I think all these things will, you know, kind of compete, right, in the system to be able to reduce the, right now, like pretty high cost basis of doing these things, you know, stateside, right?
Starting point is 00:47:14 I think one of the, one of the core challenges right now is we have a wide array of services and, you know, CROs and contract, research, collaborators that you can, try to chain together. There's, you know, kind of previously the virtual biotech was a concept that was very much in fashion, right? Folks found out just in reality when you try to do this, even though it looks really good on paper, it's incredibly slow, right? So then folks tried the other way, which is let's just fully vertically integrate and
Starting point is 00:47:41 just own everything. Well, that was incredibly expensive, right? And obviously the answer is maybe more Goldilocks in the middle. We need really competent vendors and CROs that understand the drug discovery and development process, then we need the kind of individual companies to be able to run in a really capital efficient and lean way. I think the industry is trying to reshape around these changes right now to figure out the right way to build startups, the right way to build drugs. Yeah, I think I totally agree. I think it's an important moment. I think one thing that I haven't
Starting point is 00:48:17 seen is that we actually acknowledge it. Like it just kind of hit us in the face. And I think it's because I think the U.S. is the innovation hub, but I think we need to basically be more intentional about that in biotech. I think you see innovation in tech. I think you see that as kind of the mantra. I think innovation in biotech has actually been viewed as kind of the things that the Chinese heroes and companies are good at. I think what we're finding out is like that's not actually innovation.
Starting point is 00:48:44 And so my hypothesis is that like what the kinds of things that we're working on, we're really putting big data and AI into kind of the first layer of how we do biology. That's what innovation should look like in our space. And if we don't, as a community, push that forward, we're not going to have that innovation in the industry. And Johnny is saying it slapped us in our face. Like, we called us by surprise. But actually, one of the first conversations that Johnny and I had three years ago when
Starting point is 00:49:10 we were thinking about Sony vivo was actually, he was actually telling me about this thing that's happening in China as well. And this whole thesis around commoditization of a lot of. of the things that we think are so massively important, you know, like molecular design, et cetera, et cetera. So I think in that sense, I do agree. And I think there is, there is two ways to do it. It's like a regulatory capture, try to lobby the government and everything to, you know, put the limit on how much we can interact with the Chinese companies. Here's the other way, make it part of our ecosystem and change our thinking about business models, the way we build
Starting point is 00:49:43 our teams, to Patrick's point, you know, do we build a fully integrated team with $100 million in the bank, or a small 14% team like we are at vivo. I think these are the kind of things we should be thinking about. And I've actually, like, I want to make this into this bigger statement that's a little more Reagan-esque. I think it's morning in bio in a way that, you know, like, there's a different, we should be playing a different kind of game here. And if you want to stick to the same old school way of doing things,
Starting point is 00:50:11 it's not going to work. Old school way is what? It's a lot of planning, you know. If I had a cent, we were texting about this with Dave a couple years ago, If I had this, I don't know, a penny for every time some massive organization announces this extraordinary impressive thing and they say, oh, we are going to give it to three years to five years. Honestly, I would be super rich right now. This is like, this is the ethos in bio.
Starting point is 00:50:34 You announce this massive thing and you say, you're going to do it in three to five years. No, I think it's the time. We have the tools. It's the time to build and it's the time to do it right now. That's the way Evo 2 actually gets, you know, created in like a matter of months, you know, from the first Evo paper to what happened. That's the way Tahoe gets created. The second piece is small super focused teams of superstars, you know, massive organizations, the vertical integrated ones, they actually, it's not just the capital intensivity.
Starting point is 00:51:00 They're actually very inefficient, too. They go very slowly. You actually bugged them down in a lot of bureaucracy. And I think the third piece is associated with this naysaying thing, again, like in everything you want to do in bio, there are a lot of this very strong biologists that would tell you why this is not going to work. I think that has to change We have to change it
Starting point is 00:51:20 We have to think very differently about this We have to try things out And now we have the tools to do it On this last point When I talk to pharmacos You know, they'll say Oh, AI and drug discovery Very interesting
Starting point is 00:51:30 You know But you know what I actually don't spend that much Of my top line budget On drug discovery Most of it is wrapped up In clinical development And so a lot of them actually
Starting point is 00:51:40 Are much more excited about Things like, you know Natural Language Workflows To summarize clinical trial documents right, which are, you know, these massive regulatory filings and summarize them and make it easier to write these things and read them and, you know, just more normal AI stuff. Spotify and cohorts.
Starting point is 00:51:58 Yeah. Yeah. Reducing costs in that part of the cycle. And I think the thing that they're going to see as these models get better, right, virtual cell models actually help you find the right target where you can actually point the cannon in the right direction and measure twice and cut ones. is that, you know, the cost basis for the industry will go down and the accuracy should go up.
Starting point is 00:52:21 I'm really glad both of you actually just brought up the naysayers because if you weren't going to, I was going to. I think I have now been pitched AI for biotech companies for at least a decade, right? And we haven't seen lots of – and there's also just the natural life cycle of bringing treatments to market. So let's say, like, you actually need 11 years plus generally. But, like, what would you, if you were going to leave, like, a broader audience with, like, a single claim about why that is true, obviously there were different approaches from, like, let's say, you know, a decade ago it might have been computer vision and consumer scale sequencing data, right? But, you know, why should this work now, or when should we actually begin to see treatments from these approaches in machine learning? I mean, I'd go back to, like, analogies this in the machine learning space. we had, you know, we called them artificial neural networks for a long, long time,
Starting point is 00:53:16 and then people would get all wrapped up around, oh, this perceptron can't model an exclusive orgate or whatever. Cetron, what is this, 1990? Exactly. And it's sort of like, you know, bounced around for a while. And it wasn't until, you know, we had increase in compute, increase in data, and then, you know, more sophisticated models that you sort of hit these nonlinear inflection points, right? And I mentioned earlier about, you know, the ImageNet moment in 2009, and what happened
Starting point is 00:53:40 there was that it sort of drove. development of convolutional neural networks, I think was the Alex Knapp was the model that really showed the way. And before that, you know, we would think, oh, only humans could recognize images at high-quality computer will never do it. Of course, now we know computers can do that better than humans. And so I think it's the same thing in AI and biology. And when I look, you know, sort of coming into this relatively new, like when I see the capability
Starting point is 00:54:03 on single-cell sequencing, it's kind of mind-blowing if you're not a biologist, but like this idea that we can take, you know, at a single-cell resolution, we can look. and how its expression is changing over time. Like, it's incredible. You take that, you then take the ability to generate lots of data around that, and then you take these much more sophisticated models and model training, and suddenly things are happening. Like, if you look at the Evo2 model, we trained it on 9.3 trillion nucleotides,
Starting point is 00:54:31 but we didn't tell it anything about DNA. We were just like, here's a lot of DNA on the planet across, you know, every single piece of DNA we could get a hold of. And then what did the model learn? It started learning all sorts of things. Like it knows where ribosome binding sites are. It knows where what coden degeneracy is. And then one of the things we showed is it can actually predict, you know,
Starting point is 00:54:52 the pathogenicity of the Brachau 1 variant, right, which is known to drive, you know, breast novarian cancer. And it does that with an area under the rock curve of like 0.94, if I recall, looking at honey. And, I mean, this is incredible. And we never taught it anything. It just learned this stuff, zero shot. And so I think we're at that point of inflection now.
Starting point is 00:55:10 I think all of us are kind of, you know, would be on the same, all agreed to this, that I think we're at that point of time now where we're going to see that inflection. And it's going to be about, it's going to be the data, right? That's going to be the difference between where we were yesterday and where we are starting this week. It's going to be the data. So are we, so we're somewhere between GPT1 and GPT4, right, in biology. But where do you guys think we are? I'm like, I'm more like two.
Starting point is 00:55:38 Yeah. Yeah. and we're like developing GPT2 but we're like we don't have enough data guys we need more data I think it's
Starting point is 00:55:50 if you actually go a little deeper and you talk about different domains I think in the protein models we are past GPT3 when it comes to single cell models and virtual cell models
Starting point is 00:56:03 yeah I think GPT 1 to 2 right now I think we're closer to GPT 1 than 2 that's a pretty exciting timeline though, if you just take the progress and the pace of progress in other domains and apply it here. But I think the difficulty is exactly what you said, that with GPT4, you immediately knew what you had. But if we hit GPT4 of, you know, cell estate models, for example, for drug discovery, as you said, it will take some time to actually prove that point. And I think that there's a law of small numbers always takes hold in drug discovery, right? You know, a platform that takes your success rate from, you know, 10% to like 30% is amazing,
Starting point is 00:56:48 but still it's like 30%. You need to get lucky. Right. And you still have the drug development cycle, which is an order of 10 years. So you still have to wait for that to prove itself. To slowly go up in a 10-year rolling window, right? There's a concert to this. If we're six optimists here, then I will say, like, we're just going to treat it.
Starting point is 00:57:05 And systems people, we're just going to. treat as a system. And if this was a terribly debilitating bottleneck at the beginning, then hopefully it's a breakthrough. I think that's a great note to end on. Connie, Dave, Patrick, Nima, and Johnny, thank you so much for doing this, and congratulations. It's the data. Find us on Twitter at No Pryor's Pod. Subscribe to our YouTube channel if you want to see our faces. Follow the show on Apple Podcasts, Spotify, or wherever you listen.
Starting point is 00:57:31 That way you get a new episode every week. And sign up for emails or find transcripts for every episode. at no-briars.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.