Latent Space: The AI Engineer Podcast - 🔬 Training Transformers to solve 95% failure rate of Cancer Trials — Ron Alfa & Daniel Bear, Noetik

Starting point is 00:00:00 So we basically opened the lab. We hired a team. We got all the instruments. We started sourcing tumor samples. There was no prior here that any of this would work like zero. We just started generating data. And like sourcing human tumors, processing. We built this whole processing pipeline to get the tumors into like these arrays and the formats.

Starting point is 00:00:19 So you've got like these two-week runs where you're processing two slides. And we're just churning data for months. And we couldn't even train a model. So we sort of just built all this. And then, let's say, 18 months later, hey, I wonder, can we train them all off? And then it was not, you know, like it wasn't obvious. Yeah, there wasn't really, like, anything major to go off of, I mean, there were, like, Transformers developed for single-cell data.

Starting point is 00:00:45 There just, like, weren't really data sets out there that people had been able to develop on. We do a lot of, like, custom model building. Hi, there. I'm R.J. Hanicki, and this is Brandon Anderson. We're the co-hosts of the Latent Space Science Podcast, and today we're really happy to be in the studio with some of the people from Noetic. I'm Ron Alpa, co-founder of CEO of Noetic, physician scientists by training. My hobbies are making hot takes about AI-churing cancer.

Starting point is 00:01:16 Hi, I'm Dan Baer. I'm VP of AI at Noetic. I'm a biologist by training, did Ph.D. work in neuroscience, and then moved into Compneuro, computer vision, self-supervised learning. and have been doing AI research at Noedict for the past few years. Maybe we should start with what is Noetic. Why did you found it? What is the difference between Noetic and the other virtual cell? Yeah, companies.

Starting point is 00:01:41 Maybe just start with a little bit of the contrarian thesis, which is really the reason for founding Neutic. We all know the numbers that 90%, 95% of cancer drugs fail in the clinic. Why do they fail? So our thesis is they fail not because we're bad at pharmacology, not because we're bad at target selection, you're making the drug, we're actually better at that process than we have ever been in the history of drug development. Most of those drugs fail, we'd argue, is because we're bad at selecting which patients those drugs are never worked in. And

Starting point is 00:02:14 oftentimes you see trials where there is no placebo effect in cancer. Some patients respond to these drugs. And if you have a patient that responds, that tells you something that there's some biology that that's active there, but you have a problem in patient selection. And so really, that's the thesis behind the way. Because can we build models that can fundamentally understand patient biology from the very beginning and help you position molecules in the right patient population? So you're actually using the models partly, at least, to select the patient cohort, not just, so you can imagine working either way. You could design, oh, I think that this molecule will do well because I know something about the patient population, but you could also say, I think that this patient

Starting point is 00:02:56 population is the match for this molecule. And that's sort of the power of the models is like, once you've trained these models on patient data, you can use them on both sides of the equation. So you can use them for discovering new targets directly from the patient data, which people often refer to as reverse translation. So starting from humans and then trying to understand which targets to go after, and then you can use that to develop molecules. But you can also use them directly on patient data. If you have, you know, let's say have phase two or phase three trial, you can use these models to understand which patients or what underlying biology of the patients in the trial is a predictor of response. And we've been doing a ton of bad recently.

Starting point is 00:03:42 Are you doing a lot of like rescuing trials that had a bad effect? We are doing a lot of looking at like data from phase two, phase three trials, and then using the models essentially to run inference on patient biopsies and understand whether there's underlying biology that would help us design the next trial. We haven't shared any of that, yes, but you'll see this, too. So cancer is kind of like infamous in that late. There are many, many different types of cancers whenever it says, like, cure cancer that is almost a meaningless fact you a statement. So your point is even amongst cancer, or you pick us with type of cancer, then a subtike and a subtite, and a subttype.

Starting point is 00:04:23 There's a bunch of different patient populations that each one of them will respond differently to drugs. And your point is you can figure this out right now, that like some subpopulation will do well and respond to this drug when you think, generally speaking, the rest of the population would not, even though we have historically classified this is like what type of cancer or what indication or so on. Yeah, that's exactly right. And I would maybe even go further and say like nobody actually knows what the subtypes are. There are cancers that originate in a certain tissue like the lung that, you know, have been classified into subtypes based on pathologists looking at them for,

Starting point is 00:05:01 you know, more than a century. And, you know, those subtypes certainly have some connection to the real, like, carving nature at its joints, like what are the actual functional subtypes of disease there? But our thesis is kind of that if you look at the data, a much richer kind of data, so the multimodal data that we're generating in our lab, we're going to see that actually, you know, what people thought was one subtype of lung cancer is really three distinct subtypes of cancer, and that is going to be critical for figuring out which patients should get which drugs. Yeah, maybe I'll just go back to, like, one of your first questions.

Starting point is 00:05:41 And, you know, I was saying, like, drugs don't, you know, many drugs fail in patients because we don't understand which patients they will work in in oncology. Why do we end up in that situation? So whenever you make a new job, you do a set of experiments in cell culture, cells in a dish, those cells are often cell lines. These cell lines have existed for 40, 50 years, and they're immortalized. So they have genomes that allow them to persist that have abnormal numbers of chromosomes, They have gene expression patterns that don't represent any known cell in, like the human body, really.

Starting point is 00:06:22 These are sort of Frankensteinian cells. It's a cancer and dry, but it ruinlessly can. They're mostly cancer. And so you can do your experiments in these cell lines in a dish, or then you can move these into animal models. And in oncology, you often have, you know, sort of a panel of different animal models with, with, you know, different cancer types that you'll test these in. And we, in doing these experiments, we sort of convince ourselves that some of these cell lines are, let's say, lung cancer cell lines or colon cancer cell lines. And then even that some of them in the mouse context are colon cancer cell lines and lung cancer. And then we, in the mouse, we implant them under the skin in like weird places and we treat the mice with drugs and we see how they respond.

Starting point is 00:07:09 But ultimately, there is a big gap because they don't treat. translate to patient biology most of the time. So these cancer cell lines, most of them don't even, you know, even if they are derived from a colon cancer, they don't even have the mutations that human colon cancers have in many cases. And so, and pharma has done this for, you know, 20, 30 years where you develop a drug, you test it against, you know, hundreds of these. It's not an art experiment. We can, you can send this out to any CRO. They'll test your drug against hundreds of different cancer cell lines, and you can sit back and say, okay, well, which of the 50 column lines responded to my drug and which of the 50 covariance cancer line? And you could try and map that

Starting point is 00:07:53 to human biology. But the problem is these cell lines as an abstraction do not relate in any way to human patients. And so what happens is ultimately, no matter what you do preclinical, that the molecule gets in the clinic, and the clinical team says, look, we don't really know I designed this trial because none of the data that you've produced gives us any insight on which patients to run. So we're going to basically enroll an open label study. So we're going to enroll all tumors, all patients that are, you know, enroll in this trial, and we're going to see where we get signal. Imagine doing that in an early phase trial where, let's say, you have 50 patients and you're trying

Starting point is 00:08:32 to do, you know, test different doses, and you don't really know the dose of the drug, and you don't know what the safety margins are, and you're also trying to figure out, where is my signal? And then what if I told you that, let's say, in just lung cancer, hypothetically, let's say there's only 10 different subtypes of lung cancer. And you don't even know if it's lung. It could be any. So, you know, this is what happens. And oftentimes you get to the end of these early state trials, and you don't see very many responders, as you would expect, statistically, and then these molecules get canceled. So you're imagining that your noetic system, you help the pharmaceutical company to characterize, we expect that people with a certain genetic profile or even transcriptomic profile will respond to this drug.

Starting point is 00:09:23 And then you go and you actually sequence from the patient and you say, yes, this is a match. You know, is that the sort of grand vision? Yeah, I mean, I would say we are even less biased than that. We are saying, okay, well, we want the model to learn, let's say, from lung cancers. We want the model to learn, like, how many different therapeutically relevant subtypes of lung cancers are just from self-supervised learning from the data. And those subtypes could be driven by large genetic changes. They could be driven by, you know, immune changes. It could be really driven by any biology that the model is, like,

Starting point is 00:10:01 learning in the process of training. And we do see, you know, different types. I mean, feel free to contradict this, like, as the actual doctor here. But, like, you know, the biomarkers that, you know, people have been using are, you know, bias towards simplicity, you know, does the patient have this particular mutation? Sometimes, like, stain for this single protein or, you know, do transferptomics, like, to look for a particular gene signal. but like there's no reason to think that biology or like biology of cancer is that simple

Starting point is 00:10:39 that you're going to capture, you know, most of the meaningful variation with such simple biomarkers. And, you know, most of them, they have like weak correlations with, you know, clinical success. But the hypothesis really is here, like, again, if you were to carve nature at its joints and figure out what's really going on is there, you know, these five subtype. that the correlation there between which patients you give at particular drug and whether you have success is much, much stronger than if you're forcing yourself to go with these like very simple biomarkers. You mentioned the lab. You do a lot of data generation in the lab. So why do you think that that versus using existing public repositories or whatever is appropriate? Yeah, we generate all our data in the lab. Everything from sourcing tumor samples themselves to processing them and generating the data.

Starting point is 00:11:36 Maybe another hot take I have just in AI and bio is you're sort of not at the order of magnitude of data that you are in other spaces of building training models. And so it becomes really hard to brute force these problems just by collecting data. We have a couple pretty good examples of where someone has designed a data set. So PDB was designed and has been built over the past 50 years or so. And so it's not an accident that that data set exists. Someone decided that we are going to design this data set. We're going to collect this data over decades and decades. And then with the intuition that potentially this would help solve protein folding down the road.

Starting point is 00:12:21 And it did. So it's not just that PDB. be as a bunch of random data that, you know, has been that people have organized from the web. I think that in bio, you really need to be intentional about the data that you generate and how you generate it and have some foresight around, well, what are the models we're going to want to train and what are the models going to need to learn from from the very beginning? So that's why we've taken this approach. Yeah. And I mean, like a good comparison is to the ImageNet data side, which kicked off the deep learning revolution in computer vision,

Starting point is 00:12:54 with convolutional neural networks, like actually demonstrating that neural networks can do better than other methods on object categorization. ImageNet is, at least the part of it, that people were developing models on, is 1.2 million images very carefully curated. These are high quality images, not like random images from the internet or like multiple data sets

Starting point is 00:13:20 cobble together. And labeled? Yeah, and labeled. Yeah, and labeled. And I think with the data that we're generating, we're around that scale right now. But, you know, of course, people have gone much, much larger in image datasets and language data sets, text data sets,

Starting point is 00:13:38 obviously for LLM. So we think that we need to get the data up to that scale before we can really see the meaningful progress on the algorithm side. The scale of language data. Yeah. Language is really the only modality. where people are seeing these very impressive scaling results,

Starting point is 00:13:59 and part of that has to be just a scale of data that's there and that the models are trained on. That can't be the only thing because, you know, there's a lot of, like, video data as well. People are training on, like, thousands of hours of video data and, you know, haven't seen kind of the scaling results that you have in language modeling, but having the right scale of data is necessary,

Starting point is 00:14:23 if not sufficient to like really make progress here. Can I refer a contrary to take to that? Sure. So I mean, there's this whole concept about the jihad frontier of LMs and Derd of AI. It how like certain regions that can be really good at solving some problems and then remarkably stupid to solving your right problems. And maybe the arguments with happening is that a lot of these one-tier models are just becoming massively like everything is becoming in distribution.

Starting point is 00:14:49 Like if everything starts out O-D, if you just get more data and not, becomes inter-distribution. Is it possible that for biological systems, because these are, they're underlying physical processes here, that you can basically make things more in distribution earlier and that you can't actually cover the space? I kind of have some follow-outs with PDB, but I'm just curious at this point. Yeah, I mean, I think it's a good question is like, sort of how much data and what kind of diversity do you need, like, in biology, to solve, say, like the drug translation problem, like figuring out which drugs are going to work in which patients. My intuition from working in biology, like for a while is that we're still pretty

Starting point is 00:15:36 far from that, like, because, you know, we're building data sets that are focused on right now cancer and, you know, have generated data from thousands of patients in a few major cancer subtypes, but there's like every other disease, there's healthy tissue, there's even other species. You know, there's a lot of biology to learn, especially if you think about it as we have to learn kind of the spatial and functional patterns of tens of thousands of genes, tens of thousands of proteins, how their spatial arrangement contributes to the function of organs and so forth. My hunch is that biology is like pretty complex and that we still need to generate a lot more data. But yeah, I don't know.

Starting point is 00:16:23 But as a cancer company, do you think you could actually do this hypothetically for cancer? I mean, for at least some sort of class of the mainstream? Definitely. Yeah. I think that we've done experiments that suggests that, you know, if we can generate data from several hundred patients in all of the major cancer indications and some of the less major indications that that will result in a model that can generalize pretty well to kind of any type of cancer we would throw at it. Backing up, what is the data you're collecting? Because my understanding

Starting point is 00:16:57 is you use some pretty specialized instruments and gathering very specific data sets. So how did you come to that decision about how much data, how much to spend on it, and what types of data? I'll give a hat tip to my previous employer, recursion, so we spent six years at recursion. from the very beginning. And a lot of what we were doing in the early days was figuring out the things we didn't understand about the data sets and figuring out what the problems would be in the data sets.

Starting point is 00:17:27 So batch effects, controls, how the Orient samples on plates, things like that. Flash forward to founding of Noetic, started the company already with some principles around how we should think about building the data set. What are some things that we know mattered? So, for example, over many years, we learned that images are actually a really powerful data set for each team learning for many reasons.

Starting point is 00:17:51 One, their skill. So we can put patient samples on slides and on a single slide, we can capture many patients worth of biology. The images themselves are very rich sources of biological information beyond. Now we have a very information-dense modality, and we can decrease the cost of data generation, so then we can increase the amount of data generation over the whole dataset. And that's always been a really big benefit to image-based modalities over, let's say, sequencing, where every time you run a sequencing run, you're basically, your hand is, you know, a patient safe.

Starting point is 00:18:27 That was one way to think about it. The other was how do we design these datasets so we can control for things that we know are going to be important, such as batch effects. So, for example, if I have a slide, we do a, I would say a spatial transatlantic. script down this run on that slide. You stay in the slide, do a bunch of wet lab processing, you put it into a machine, you get data out. If you do that on two different days,

Starting point is 00:18:55 there are going to be different variables that impact the data. That's going to be a large source of variation in datasets. So you want to be able to control for things like batch effects. So really you want more patients represented on multiple different slides, so you can process them different and different batches. So you want to be able to control for things like this so you can go downstream and look at the data and say, okay, well, once we have, let's say, patient level embeddings, we can ask, well, do the patient level embeddings represent, let's say, patient response to immune therapy, or do they represent staining batches? So you're actually taking different one patient and you're spreading across multiple slides so that you can get a, like a, it's sort of a calibration across the slides. Yes, our data looks very different than anyone in the space of generating.

Starting point is 00:19:44 data on histology or digital pathology types of specimens. So we receive a sample. We sample those samples dozens of times to build these arrays. And each array has hundreds of different patient samples randomized. And every patient is represented on multiple different arrays. And so we're getting a lot of different representations of each patient that we're sending through the data process and pipeline. And then that lets you downstream be able to answer some of these questions and control for some of these periods. You mentioned some terms I just want to define for people, spatial transcriptal what is that? Yeah. What is that? Yeah. So what would be? I mean, this was your first question. So what are the data types? So if you just sit back and this is not my background in terms of spatial,

Starting point is 00:20:27 again, everything we did on your previously was cell biology and a dish. If you just sat back and you said, okay, I want to train a foundation model that understands human biology. What does that mean? What will be? How would you go after that problem? And that was really the starting point for the company. is okay, but from first principles, how we do this? So you probably want tissue-level biology. You want to understand tissue, cells are organized into tissues. You probably want some modality that is relevant in clinical use. So you can relate clinical data to what your models are learning.

Starting point is 00:20:59 That's why we generate pathology, H&E. So that's, you know, what every patient gets a tumor removed, and then they get this stain on H&E. And that's what the pathologist. I can't explain where H&E is on. It basically two different dyes, hematology selin and ESN, and it really just creates a contrast over the tissue. So you've probably seen these like purple-ish pathology specimens. So pathologists can look at those and they can identify different cellular structures

Starting point is 00:21:29 and they've used those to classify tumors based on, you know, the classical classifications of, you know, had no carcinoma, as small cell carcinomas, things like that. on basically cellular structures. Okay, so there's like a specific patterns which show up when you add these two sayings and it is well established that like you classify tumors based on. Based on, yeah, pathology on your classifications.

Starting point is 00:21:54 And this is what every, basically every tumor, you know, that gets processed in the hospital will get this H.E. And it's how the pathologist typically classifies a tumor from the first level. So, okay, so you want that. You probably also want to understand cell types. It's really odd to understand cell types from just that stain because it doesn't reveal that much that a human can use to classify cell pipes at least.

Starting point is 00:22:20 So you can say, well, I want to know whether there are immune cells and different subtypes of immune cells. We want to have some layer of cell biology. Okay. And you want to know about immune cells because, like, you have these cancer cells and oftentimes the immune response dictates whether or not, like, it will be of an effective treatment or. it's like the immune environment of the tumor will be a core. We know is a core constituent of whether a patient's going to respond or not. So you want to know, okay, you want to give them all this. So the mall is going to get this tissue level information.

Starting point is 00:22:51 There's not enough cell level information in there for the model to learn enough cell biology of all different subtypes. So we also want to present it with some cell level information. So we use protein stains, so standard meoforescence. So you basically use antibodies against small set of. of cell markers to label different T cells, B cells, your standard subtypes of cells in the tumoury and microbarin. So in this stain, just for those who are familiar,

Starting point is 00:23:19 the stain on the antibody, has a fluorescing protein when you hit it with a certain frequency of light, then it fluoresces so you can tell the antibody bound to a certain protein and now it has a fluorescent garrotine attached to it. Yep. And in terms of the data, so from From the tissue layer, you have an RGB image.

Starting point is 00:23:41 From the next layer, you have a multi-channel image with each channel representing, you know, let's say one color. And so, for example, certain immune cells are each in a different channel. So you have this multi-channel image. Now, okay, so that's great. So we've got tissue and we've got cells. But if we actually want to make drugs, we need some type of molecular information. We need to tie all of this down to what's happening in the genome.

Starting point is 00:24:06 what is the cell doing? What are the mechanistic principles of the biology? So then we get spatial transcriptome. So that's spatially resolvable RNA. So DNA transcribed into RNA, which is translated into proteins. So we get basically the RNA in a spatially resolved pattern for the same cells that we're seeing all of these other layers. So now you have between 1,000 or 19,000 different genes. And again, these are all image layers that are spots of where those RNA are in which cells. And this works a little bit similar to how we talk about protein where you have a segment of RNA and then you have a fluorescent protein. And usually there's some sort of combinatorial things.

Starting point is 00:24:51 So you have, if you see these four colors in this amplitude, then that means this gene because they're right to each other or something like that. So for the detection method, you're basically binding a probe at each one of those RNAs and then you're cycling it, and it takes weeks to run one of those assays. So you're cycling the machine will cycle across each species, and it'll amplify, and you'll get a signal for each RNA species. Now, at this point, you now have basically this very rich data layer where you have the tissue, you have the cells, and you have the molecular information,

Starting point is 00:25:22 and you can use all of that to train the model, and so we think of it as, if it's essentially the central dogma, if you will. And we also have DNA, we genotype just so we understand the, genomic alterations in these tumors. All right. So you get the stack of images, basically, that you can train models on understanding the expression of genes and the proteins that are being expressed at the time that the sample is taken all in the image information. And then you can train your models with that. Yeah, I mean, the spatial transcriptomics is like particularly dense because if you think,

Starting point is 00:25:57 let's say, there are 20,000 genes in the genome. Now, you know, we're running assays that are detecting nearly all of them in a single sample. So you can think of one of those data points as an image, except instead of being an RGB image that has three color channels, now all of a sudden it has like 20,000 color. So it's like a very meaty computer vision problem to try to look at those data and figure out what makes patient A different from patient B and then go from that to which drug is going to work in which way.

Starting point is 00:26:33 And so you have a hot take about virtual cell? Like, I want to understand how, okay, so you, you know, you have this big pile of data that every single sample has a massive data set with it and then you have many, many samples. So how do you turn that into useful knowledge? Maybe just what is a virtual cell? Everyone's always, you know, asking that question. I think there are really two ways to think about it. You know, one is we want to be able to simulate all the biochemical processes in a virtual cell. I have always, you know, you know, one is we want to be able to simulate all the biochemical processes in a, you know, you know, cell. So we want to have this sort of comprehensive foundation model where we understand, you know, if some signal from outside the cell interacts with the cell, then here are the millions of intracellular chemical reactions are going to happen, and you could sort of predict them, you know, from the model. So that's one view. I think that's interesting, it's sort of an interesting intellectual pursuit. I don't think we have all the modalities of data. that you would need to solve that problem.

Starting point is 00:27:35 I tend to see the virtual cell problem as something more practical. We're trying to make drugs that work in patients. So from a virtual cell perspective, really what we want to do is understand cell biology in some heuristic that's useful for making drugs. And the heuristic could be a way to understand gerotariats or a way to map your cell level biology up to patient level biology. And so the way we've designed these first virtual cell models is maybe just to simulate the biology of a cell in some context.

Starting point is 00:28:09 And the biology of that cell being, you know, let's say the cell being in some context and the output being, you know, the transcript in that context or, you know, the protein in that context. And these types of, you know, input-output-outure relationships allow us to essentially design experience. And so really the very simplistic thing that we're doing is really just the model can simulate the biology of cells,

Starting point is 00:28:31 or many cells in different contexts and give you, and Helata can run some simulations in that regime. Yeah, I mean, I think what, most of the things that people are calling like virtual cell models right now are focused on single cell gene expression, so transptomics data, RNA data,

Starting point is 00:28:52 and they're largely geared toward the problem of predicting what's going to happen to the transcriptome. So the set of genes expression, when you hit cells with either a small molecule, a drug, or a genetic perturbation. And typically, this is cells grown in vitro, like either cell culture or primary cells, something like that. I think that...

Starting point is 00:29:16 Genetic perturbation being where I, like, knock out a gene or adogenally and see how that impacts the expression of the de barious RNA. Exactly. And I think my view, and I think Ron shares it, too, is that, like, may be of interest in some cases, but the problem we're really trying to solve is predicting what's going to happen in a patient. And you're just modeling data that comes from a patient is, in my mind, much more likely to translate to what happens when you give a patient a drug than something that's happening in cell culture. Is there other clinical data that you're pulling

Starting point is 00:29:56 into the model besides the actual, so you're calling the context of the cell, just the surrounding cells, but is there other this drug caused a bad reaction kind of stuff? Yeah, I mean, we're pulling in data from the entire patient, so not just, you know, the very local neighborhood of the patient. So far, we haven't done much integration of, you know, like electronic health records or, you know, other information that one could get about the patient. And that's pretty intentional. Like, we really want these models to learn basic biology, again, like the central dogma, not just the central dogma, but, you know, the basic biology of genes, protein, cells, tissue in a self-supervised way. So purely from the data that we're generating and not

Starting point is 00:30:49 be biased by, you know, what the doctor wrote about that patient. Because, you know, our thesis is kind of like most of the therapeutically predictive and important information is not contained in those very small number of patients who have been treated with a given drug and whatever the doctor's thought was important to write down, given the state of knowledge at that time. So it's much more about trying to discover what's really there in patient biology than go based on the text that people have written about it. So you have this self-supervised model. You eat a lot of data. You have essentially some clusters of patients now. How do you translate those clusters of patients to making decisions? Like you go to a pharma company and you say we can repurpose or we can suggest this subtype

Starting point is 00:31:44 should be the focus of your phase two trials. Like what is the process for that? What data do they need to provide you and how do you translate your models? So it depends. on what the problem is. I think it's important. So one of males that got up. One of the more interesting aspects of these models is they are useful for a broad array of use cases, as we were talking about from the very beginning. So you as the pharma company could say, okay, well, I have this molecule and the target of the molecule's X, and I want to design my clinical trial, the molecule has seen zero patients apart. All I know is the target and some biology around the target. So we, can run simulations using the models and our cohorts of patients. And let's say if we were to look at, you know, in lung cancer, we can run simulations around the target and ask, okay, which sets of patients here would this target be important? And across a cohort of, you know, lung cancers and colon cancers and, you know, across all of oncology. And you might see, and we see this some, sometimes you might see that, you know, your target probably don't want to put it in lung cancer. Maybe you want to put it in ovarian cancer because it's not really important in lung cancer.

Starting point is 00:32:53 Yeah. What are you simulating here? So, like, are you, you say that this drug is expected to knock down this gene, and therefore it will result that you want to look for clusters where knocking down this gene inhibits tumor growth rather than enhancing tumor growth? I mean, that's certainly one way we could do it. There are other types of simulation where you might just want to ask, like, if there were immune cell here, like a T-cell, which is responsible for,

Starting point is 00:33:23 actually killing tumor cells, what would happen to it or what genes would it express or what proteins would express in this particular patient's tumor microenvironment. And that's what we've called, like these virtual cell simulations, like we have a model called octo virtual cell that does this. And that can give quite powerful answers to the question of, are these drugs going to work in these patients? Because you might find, like, actually, as Ron was saying, the thing that this drug targets is just not important in this particular patient's tumor in that there's not, like it's not going to have any effect on the T cells or the macrophages or some other cell type there. Then, you know, there's the type of simulation you alluded to where you can ask the model

Starting point is 00:34:11 what would happen to this patient's tumor if you were to knock down this particular target gene or its protein product. And you might be looking for cases where the model predicts that removing that gene or that protein is going to have a large effect, like either increase the immune system function, its ability to fight that tumor, or decrease the tumor's ability to grow, or some other readout that you think is correlated with clinical success.

Starting point is 00:34:44 I just want to call out maybe the simplest use case is the one where there's like a company that has a drug and they've given it to some patients and we know some of those patients responded and then it just becomes like a question of like has the space of patients that the model has learned via self-supervision tell us that all of the response of patients are in one of these clusters and not the other nine clusters or something. So if we know that, then there's a pretty straightforward hypothesis. that this is the right cluster. So that's the scenario where you would sequence something.

Starting point is 00:35:22 What would you collect about those? So you have a cohort responded and one that didn't. Yeah, so this is getting back to something Ron mentioned earlier, which is this type of data called H&E. It's a stain, the standard hatology stain that makes these, you know, pinkish and purplish-looking images. Right now, what we do is we've built models that are trained on, kind of all of the multimodal data we generate, but then once they're trained at inference time,

Starting point is 00:35:53 all they need is an image of H&E, and that could be something that we generate in our lab, or it could just be, you know, a digital image that they have from a trial that was run years ago. And the reason that that is so powerful and flexible is, again, because H&E is kind of like the lingua Franca of pathology and especially oncology. Almost every patient who's been given a clinical stage drug is going to have that. You can look at the two cohorts, the responders and the not responders, and say these H&Es live in this part of the latent space and these H&Es do not. Yeah, exactly. And I think, you know, one way we've gone further than that even is given the H&E, they can say, I predict that these genes are expressed at this location in this.

Starting point is 00:36:47 this patient. So not only do we have these clusters, these embeddings that say, you know, all of the responders to this drug are over here, all of the non-responders are over there, but we can actually see, okay, for the responders, these are the genes that are expressed much more highly, are predicted to be expressed much more highly in the responder cluster versus the non-responder cluster. And so that adds a major, like, level of interpretability there, because You know, we can see things like, okay, like good. The responders are actually expressing the protein target of this drug. So we would be worried if that weren't the case, but, you know, we can see it is.

Starting point is 00:37:31 On the other hand, we also see that, you know, the biology is very, very complicated. So kind of explaining why these simple biomarkers, like looking at a single gene or single protein, just really don't capture, you know, what is predictive of therapeutic response. So I have like a million directions that I want to go here. H&E, that actually gives you a pathway to a diagnostic then as well. Exactly. Yeah. Right.

Starting point is 00:37:55 Yeah. Yeah. And so that you can imagine after the drug hopefully makes it to the market, then a doctor says, oh, you have cancer. I'm very sorry. We're going to do a H&E stain of your tumor. And then we're going to put in the model. And it says, oh, you know, this one won't or free, but this one won't. That's right.

Starting point is 00:38:16 And you can, so we're using the same approach for actually today. We're looking at many different mechanisms from different collaboration that we have in place. You know, one of them we've announced with a company called a genus. These are all different mechanisms. The input is still H&E using, you know, and some of the same indications. So using H&E, rasty, whether JAR-H-H-H-R-H-E works in some sets of patients, whether J-B works and other sets of patients. And so you can take that, you know, to its natural progression and say, well, okay,

Starting point is 00:38:46 If you can use that same input, just H&E, for, you know, experimental drugs, why not use it also for drugs that are on the market already? In a sense, the same assay can they can be very predictive across many different cancers and many different potential therapeutics. There are lots of models that take H&Es and go to gene expression out there, open source, whatever. They do, you know, so-so. I've read in Twitter, your Twitter feed, and whatever that you feel that you have,

Starting point is 00:39:16 have a data mode, right? And so why is No edicts model better? Sure. I mean, I think, you know, the scale of data that we've trained these models on is like, you know, pretty different from a lot of what's out there. Like the realities, there's just not that much of this kind of paired H&E plus other data modalities. Typically, you know, there's some data sets generated by academic labs, others where, you know, they might have maybe like a hundred or a few hundred patients worth of data with paired spatial transcriptomics. That might even be an overestimate. In comparison, we're generating these data that are multiple patients per slide,

Starting point is 00:40:00 individual patients distributed across multiple slides. We've generated now more than 100 million cells spatially resolved spatial transcriptomics that's all paired with H&E and protein as well. at least in order of magnitude larger than any of the other data sets that we've seen out there. And I think that makes a pretty enormous difference. I mean, we've seen with our own models that if you drop down to 40% or 10% of that data used in training, and the models get a lot worse. And they especially get worse at kind of generalizing to other types of cancer from the ones that they've been trained on.

Starting point is 00:40:42 So I think that's a big piece of it. I also think that, you know, the algorithmic side of it is important. You know, we've developed custom architectures specifically for training on this multimodal data. And again, my background is in computer vision and specifically in self-supervised learning there. And so we've tried to develop, you know, self-supervised learning approaches for these data that are really adapted for solving this problem of, you know, figuring out what is different in one patient versus another. simulating what would happen if you were to knock down a particular gene or protein or something. So this is why we call these world models where we're trying to build models that can simulate what's going to happen if you take a particular action. I think that's another big differentiator

Starting point is 00:41:33 for these models. And then again, the interpretability as well is probably a third one. It's probably because you were just talking about how one of the other strategies people take for this is to do perturbations on cells and then watch the response. And now your experience, plus like your strategy is you can simulate this sort of counterfactual perturbation idea without even having to collect the data to that. And you can see this. Well, there's, yeah, there's a big piece that we haven't talked about yet, which is actually we are running perturbation experiments except they're in vivo perturbations using a platform based in mouse. We have another platform where we are, it's called perturbed map, Ron, if you want to describe

Starting point is 00:42:25 any of it, but basically is a platform for generating highly multiplexed knockouts of individual genes, so the same kind of like CRISPR knockouts that people are doing for individual cells in vitro, except when we knock out a gene. in a cancer cell, that cancer cell gets injected into a mouse. It's barcoded, so we know which gene was knocked out, and it's being injected alongside, like, roughly 100 other cell types with different genes knocked out. So you end up with mice that have tumors that are barcoded, that have 100 different genetic perturbations in them. We can actually use that to validate our models and ask are, you know, what the models are predicting in humans via simulation

Starting point is 00:43:14 actually borne out when you do these perturbations in a mouse system. Sorry, there's a lot to know. We take that. Barcode. Yeah, so sorry, barcoding. This is a technology in which an individual gene is knocked out with CRISPR, but also this introduces a set of protein tags in that cell that get expressed. It's a commentatorial code.

Starting point is 00:43:39 So gene X might have proteins A, B, and C. Gene Y, when it's knocked out, has proteins D, E, and F. And we can tag those proteins or label them with antibodies so that when we go and look in the mouse, we know exactly which gene was knocked out based on which of those protein tags were expressed. So you knock out a gene, but you also added a gene that has the barcode proteins.

Starting point is 00:44:07 encoded on them. Yeah, exactly. And I mean, the system's designed, so everything that we're doing here is tissue level. You could be in vivo of, you know, tumors that came for human that are in the form of the tumor, that are, you know, the whole tissue. And then here and then this mouse system, you have hundreds of tumors in the lungs of a mouse. And if you look at these images, it's a mouse lung with like literally hundreds of tumors in it. And each tumor has a distinct biology that's driven by the biology of the knockout of the gene that's being perturbed. And we can capture basically the biology of each tumor in a spatially resolved way. So what you can see is, okay, well, we have a bunch of tumors in human that we have certain tumors in humans, let's say don't have immune cells in them. And so those tumors are very aggressive and they don't respond to immune therapies.

Starting point is 00:44:59 You can generate those same tumors in this mouse system. and again, they don't have immune cells. And you can do it genetically, so you can start to map kind of the gene, the causative gene relationships between these different immune or just broadly tumor genotypes or biological profiles, if you will, to what you see in the human. And then you can treat those mice with drugs and you see how hundreds of tumors and a single mouse responds to treatment with one drug, or you can treat many different, you know, let's say 50 different knockouts across a panel of mice with 50 different drugs.

Starting point is 00:45:36 And you can start to build this intersectional pharmacology and, you know, genetic experiment. On Twitter and in various places, I've heard you say, no edict is no cell lines, no war bottles. Maybe you even said that, you know, a few months ago. And then we just said we have mouseball. Yes. And injecting cell like two. And then a lot. In the one, not under this thing.

Starting point is 00:45:58 So yes. So, you know, fundamentally, we think it's really important to. build models that are trained on human data, and we were sourcing all these tumor tumors to build human-centric models. So that is also, that is true. From the very beginning, we have asked this question of, you know, let's say we want to develop a drug from the very beginning, and let's say the FDA, and I know things have changed a little bit with the FVA, but let's say the FDA wants you to have some data in an animal that says your new mechanism, works in some animal system. What do you do? You're kind of stuck because you've now generated

Starting point is 00:46:38 arguably the best data that you can in the human system. And then the FDA says, well, cool, but does it work in the mouse? How does it work in the mouse? And then so you have to back into this system that it doesn't translate. And so from the very beginning of the company, this has been, you know, sort of a question. And so we started, you know, probably at the same time, we started generating the mouth to the human day. We started building this mouse platform. with the aim of drawing connectivity between these two systems. And so we focused on a platform. We want a platform that one allows you to map up diversity of human tumors

Starting point is 00:47:13 because we know that if we just run a mouse model with one tumor, that tumor has no connectivity. So in the mouse system, we want to have diversity of tumors, and we want to see a mapping of diverse tumor biology to the tumor biology that we're seeing in the human across many different occasions. So we licensed this system and it's been building it, so you can see many days.

Starting point is 00:47:33 different perturbations that produce a lot of the tumor biologies, plural, that you see in the human. And then we also want to be able to get from this mouse system to biologically relevant, let's say, targets or genes in the human as well. So one of the fundamental problems in mouse systems is we share many genes with mice, but there are a lot of genes in the biological process we don't share with mice, as is obvious. And so oftentimes you run into these when you're developing drugs. It's okay, you have a target, you know, some biology that works really well in mice. Maybe that doesn't even exist in humans or like maybe that pathway is like useless in humans.

Starting point is 00:48:13 So one of the things we've started to develop that we'll share more about soon is a way to use one of these models to essentially infer human biology from the mouse directly. And so we're in silico-humanizing the mouse. So all the outputs in terms of the transcriptome from. the mouse are in the form of the human genes. And so when we read out this mouse system, we're reading out in the form of human or all can hilt. How do you validate that? I mean, that's a pretty impressive claim,

Starting point is 00:48:43 if you can do it. But, man, it seems like a tricky validation task. In my experience, both hero noetic and my previous employer, I could say recursion. A lot of the, you know, a lot of the approaches you're looking for when you're building these types of models is you're trying to ask whether the models are recognizing

Starting point is 00:49:05 biology that you know to be true. So, for example, in the human context, we know that 12% of patients with lung cancer respond to immune checkpoint inhibitors. Do the models recognize those patients? Can they recover those patients

Starting point is 00:49:23 without training? Like, cold? Yeah, yeah. And we see that. And then when you go look at those patients, we see the underlying features of those patients maps to what we know about those patients in the client.

Starting point is 00:49:35 In the mouse system, we have control genes. So we ask, if you look at the mouse tumor embedding space, do the tumors that should be really cold look really cold from the human inferts? Cold in the sense, they don't have immune cells. No, my eyes, oh, yeah, yeah. And then hot in the sense of like lots of immune cells. So we try to build systems

Starting point is 00:49:56 where you have these hand olds, And then the more of these examples that you know to be true that that work that you see, the more confidence you have. Obviously, when you're into the regime of something very new, it's still uncertain to some instances. So the bridge is sort of the bridge between the mouse and the human is you build a world model on a human, build the world model on the mouse, and then you say what are the parallel structures in the two latent spaces? Is that kind of the intuition here? That's one thing that we're doing. but actually this is like even simpler,

Starting point is 00:50:28 which is that we've trained models on human H&E, spatial transcriptomics, et cetera, and then are just inferencing them on mouse H&E, which is easy to generate. And apparently mouse H&E looks enough like human H&E that the models think is perfectly valid H&E makes predictions about is this like immune hot, like immune infiltrated versus.

Starting point is 00:50:56 cold versus fibrodic versus some other tumor phenotype. And those predictions are accurate. So, you know, these are like some of the controls that Ron mentioned. So, you know, we know that in mice and humans and everything, if you knock down tumor cells' ability to present antigens to immune cells, you know, those are very cold. Like immune cells are nowhere near those tumors. And, you know, that's exactly what we see in the mouse. And that's exactly what the models, the encyclical humanized models predict. And then there are other examples where, again, we're recovering the biology that we expect to see there.

Starting point is 00:51:36 And then there are findings that are novel, but also make total biological sense. For instance, we have done knockouts in the mouse of what's a half a dozen genes that are all in the same pathway. So you might predict that knocking down those genes are going to produce the same phenotype because they're on the same pathway. And that way is that way?

Starting point is 00:52:01 Yeah, so a pathway is like protein A, signals to protein B, signals to protein C, and there's like a chain of events that leads to the cell having some behavior, you know, changes in its metabolism, its growth, et cetera. So these are, I don't know if you've ever seen these crazy-looking protein signaling diagrams

Starting point is 00:52:21 that, you know, makes you want to stay away from biology. But, you know, like, you know, people have, you know, work down a lot, and they know that these two proteins interact physically and signal to each other and so forth. And so, you know, one of some chain of those interactions that this protein binds this protein and that causes it to a regulated gene that causes another protein to be formed, blah, blah, blah, until you get to some phenotype, meaning the cell changed the way it looks. Exactly. And so, you know, based on decades of biological literature doing experiments on these,

Starting point is 00:52:58 there's a very strong biological prior that if you hit gene A, gene B, gene C, and they're all in the same pathway, you should get similar phenotypes. I mean, this is kind of how, like, old-school genetics was done. And we see that with these encyclical humanized mouse models, which is amazing to me as a biologist, that you have a model that's trained on human data, then you show it some mouse histology, and it's able to say these five different tumor genotypes all look like they have the same phenotype,

Starting point is 00:53:33 and lo and behold, there are five genes that are in the same pathway. So you guys, switching gears a little bit, because we want to talk about models on the Latens-based podcast. You guys recently, there was an interesting blog post-Tario model. It's some transformer-based model. Do you want to talk about that? Sure, yeah. So this is like new model architecture that we developed post sort of the first virtual cell model

Starting point is 00:54:01 model, OctoVC, that we developed. So Tario, this model is just a different transformer architecture. One major difference between it and, you know, our prior models. I guess if this is a model podcast, this is getting into like the self-superbomber. learning objective. So, you know, for a while, including with OctoVC, we were training models on what's called the the masked auto-encoding loss function or objective where you have a piece of data, you chunk it up into small chunks, you mask out some of those chunks, and the training task is the model has to predict the masked out chunks from the reveal chunks. Like Bert.

Starting point is 00:54:44 Yeah, exactly, like Bert. What are the chunks? Because this is multimodal, And, like, I would imagine the different channels contain wildly different levels of information. And I remember seeing something like 99% masking in Octovacy if I'm... Yeah, yeah. So... And I was like, that was kind of surprising because when you have, you know, 19,000 channels and maybe some of the channels are fairly, like, most of the signal is fairly sparse. Yeah. Then it seems like to be either there's a huge redundancy here in your data.

Starting point is 00:55:16 or you really risk like just throwing maybe out what the bat. Yeah, what are the chunks? That totally depends on which modalities we're talking about. So spatial transcriptomics, one chunk or one token might be the level of expression for a particular gene at a particular spatial location. For protein images, multiplex protein images, again, it might be, you know, the image patch for that particular protein at. a particular location and so on.

Starting point is 00:55:49 And, you know, for like histology images, again, those are usually just patches of the image. So pretty standard, like vision transformer style. The masking and the maybe surprising result that, like, you can and actually need to mask out large amounts of the data to get the model to learn anything interesting, if you ran the hypothetical where you only mask out, like, 10% of the image, you know, maybe we're more like Bert, for instance, in language modeling, what do the models learned? And they learn these kind of like boring behaviors,

Starting point is 00:56:27 like how to continue an edge a little bit, you know, between two like regions of an object or something. So they can learn that task very well, but they don't end up learning anything about sort of the holistic structure of the image data. And we found pretty early on at Noetic that, the same thing was true with these multimodal like transformers, where if you mask out a lot of it,

Starting point is 00:56:53 there are actually pretty strong correlations between where protein A is expressed and where protein B is expressed, and forcing the models to learn them is really what gives it this predictive power. And so Karyod, though, yeah, is an auto-aggressive model. Yeah, exactly. So, yeah, that was going to be the pie.

Starting point is 00:57:12 And so, you know, prior models, including OctoVC, were of this massed auto-encoding style training objective. Tario is an auto-regressive model, which if you think about it is kind of a particular choice of massed auto-encoding, except, you know, instead of randomly masking on front of the data, you're always asking the model to predict the next token in a sequence. We know that this is something that scales very well with LLMs,

Starting point is 00:57:42 like training on the next token prediction task, and it's still an open question, how do you get models of other data modalities to scale the way that LMs have scaled? Tario was not actually our first attempt, but one of our subsequent attempts to bring that auto-regressive, like Next Token prediction task,

Starting point is 00:58:03 into modeling spatial transcriptomics data. We found that when we used this architecture and this task, we started to see much better scaling behavior where bigger models and especially at longer context lengths were really outperforming the smaller models at shorter context lengths. Because they can see further an image?

Starting point is 00:58:26 Yeah, that's probably a big part of it. I think, like, there's actually a pretty subtle but very interesting result in that blog post with Taria, which is that you only really see the benefits of using larger models when you're looking at longer context lengths. And here, longer context really means, again, like you're seeing more tissue at once, more area at once.

Starting point is 00:58:52 And I'm not, like, super deep into the language modeling literature, but I don't know if there's an analogous thing with, like, language models where, like, you only see these scaling behaviors at longer context. So it could be that we're finding here is that, like, with patient data, you really do need to incorporate sort of more,

Starting point is 00:59:13 of the patient's spatial context to really get the models to learn these more complicated, non-linear patterns in the spatial transcriptomics and take advantage of it. Is it possible part of this is because you have some number of low expression genes and that the behavior is driven entirely by some under-modeling of low-expression genes? Yeah, definitely possible that the more context you have, the more likely you are, to catch kind of these low expression but highly predictive genes, et cetera. I would guess it's a combination of that and larger area. Like we've done some experiments, just like comparing model of the same amount of context

Starting point is 00:59:57 but in smaller or larger areas. And there definitely seems to be an advantage to looking at larger regions of tissue as well. I want to hear about you did a big deal recently. You got a lot of press and I think have the, distinction of being one of the only AI for biotooling companies it is making money. Accident. So could you tell whatever you can disclose about that? We love here.

Starting point is 01:00:26 Yeah. So we were really excited to announce a deal with GSK. We're relicensed them OctoVC, which is for Virtual South Foundation model. So we announced that back in January. It's a $50 million deal includes an upfront payment, milestone. and then separate than that, it also includes an annual license fee, model licensing fee.

Starting point is 01:00:49 You know, I think this was, you know, an attractive deal for both parties, for us and for GSC, because, you know, really the deal focuses on models that we've trained already

Starting point is 01:00:58 on lung cancer, colon cancer, allows us to, you know, provide them with access to the models. You know, GSK is one of, you know,

Starting point is 01:01:08 the top AI teams in biopharma. So, you know, they know how to use these types of capabilities, they can use them for their internal use. They can also use them to fine-tune on their data. So that was a really big cell for GSK as well, because, you know, GSK and every pharma is sitting on mountains and mountains of so-called translational data. So the types of data that we're training the models on that come from clinical trials, you know, pathology specimens across many different therapeutics. Everyone's sitting on a lot of this data,

Starting point is 01:01:41 and it's been very hard to unlock. And so, all of a sudden, you know, GSK can use our models, both to do simulations and to do therapeutic discovery, but they can also fine-tune the models on their data. And in a way, the model then becomes, you know, sort of GSK's version of a model. This was super exciting, you know, it was the first, you know, we first announced foundation model licensing deal in the space. And, you know, frankly, it was one, you know, we've been trying to do for a long time, even before Noedic. You know, I think a lot of companies have been trying to do these types of deals. And it's been, I think it should have been historically slow for adoption on the pharma side.

Starting point is 01:02:17 And it's been slow to demonstrate like a very clear value proposition for different types of capabilities. And so what's unique about this deal is it looks, you know, it doesn't look exactly like a software, you know, licensing framework for, let's say, a small amount of money with a number of seats where you're licensed. It looks like a real business development deal in the industry where they're a very significant multi-million dollar cash up front in your term payment. But then the substrate of the deal is not a molecule. It's not doing therapeutic discovery work together. The substrate is actually a model, which is what really made this pretty weak. Why do you think there's appetite for this suddenly? And it seems like almost whiplashed it. Yeah. It seems like it seems like only a maybe a year or two ago that bio was dying and whatever.

Starting point is 01:03:10 And now suddenly there's this deal, Boltz is getting a ton of attention. There's so much attention on isomorphic. People are AI pill. In some extent, we increase it in more. I mean, maybe not totally, but increasingly more. People are, you know, in pharma, you know, across the industry, are seeing the value of different capabilities.

Starting point is 01:03:30 They're able to use some of the open source capabilities and they're able to demonstrate the value to themselves internally. And if you look at a pharma company, you know, these companies are working on dozens and dozens of programs. And so I, you know, my opinions, just frankly my opinion is I think pharma increasingly want to be able to access models, not just for one collaboration where you and I are working together on this one program. They want to be able to access the technology across the old pipeline. And so I think that's going to create sort of a driving force for not just, you know, bespoke a project-driven licensing, but actual license. broad licensing where a pharma can access the technology in many different, you have therapy, of programs. Yeah. And I think also, you know, with the structure of prediction models, protein structure

Starting point is 01:04:17 prediction, binding prediction models, there is like this massive public data set. There are increasing amounts of data. People can generate data to augment that. So, you know, there's enough data to the point where people can train very good models, but maybe not just on the data that any one's biopharma a company has. And I think that the same is true, but even more so for the types of models that we are a building, which are, you know, foundation models at the patient biology level where, like, you know, no one company, I mean, these companies may have a lot of data, but it's, you know, scattered, it's siloed and pulling everything together to, like, train an actual foundation model may not be as easy as it sounds, like, within a single company, whereas we have,

Starting point is 01:05:04 just that you know what we're going to generate enough data ourselves to actually train a real foundation model. And that's the nice thing about being a startup here is like we can make that bet that you actually do benefit from generating all of this data in a, you know, uniformized way, like very high quality, et cetera, and then use that to develop and train the models. And my opinion is that you need to have data at that scale before you can even think about developing models that actually work. It's like you can't do the AI R&D, like or build the algorithms until you have good enough data set to tell you

Starting point is 01:05:47 whether your favorite algorithmic idea is actually working or not. That's a major advantage for us is like we have enough data to see, like is my idea or someone else is, idea about how to build a model, like actually leading to improvements there. Yeah, I mean, this is a good point. I mean, so like sometimes people ask me, well, why doesn't GSA just generate your data? So we just started generating data for years. There was no mobbed.

Starting point is 01:06:16 It was like, how many years? Like, how are like two years maybe a year and a half, at least before we're at the first trained models working? Like, maybe a year and a half we had the first. So, I mean, certainly, yeah, like the Octo VC model, like we trained in 2024. So, yeah, that's like two years after. Yeah. Through whom new starts?

Starting point is 01:06:32 So we, too, how? Zero four years is this. So this is year four. And so we basically opened the lab. We hired a team. We got all the instruments. We started sourcing tumor samples. There was no prior here that any of this would work.

Starting point is 01:06:46 Like zero. Big crazy. Like, I was just going for it. And like, we just started generating data and like sourcing human tumors. Processing, we built this whole processing pipeline to get the tumors into like these arrays and the formats. And it takes weeks to, you know, it takes literally two weeks for a machine to run a couple slides on the spatial transcriptomics. So you've got like these two week runs where you're processing two slides.

Starting point is 01:07:10 And we're just churning data for months. And we couldn't even train up, we didn't even have enough data to train a model for like at least a year and a half. And then you're building like processing pipelines. You have to align all the data. You've got to like post-process it off the machine. So we sort of just built all this. And then like, let's say 18 months.

Starting point is 01:07:29 later, hey, I wonder if this stuff, and then it was not, like, it wasn't obvious. There wasn't like, oh, we're going to like off the shelf, you know, train this on some like open source architecture. You know, we've had, you know, Dan and the team have done a ton of work. Yeah, there wasn't really like anything major to go off of, I mean, there were like transformers developed for single cell data, but like incorporating spatial data into that was, you know, again, there just like weren't really data sets out there that people had been able to develop. upon. So we do a lot of like custom model building and I enjoy that. I think people enjoy that. Because I have a lot of for joining. A lot of built custom model. Yeah, really unique, innovative

Starting point is 01:08:12 involved. Sorry, who are you looking for? Like what kind of people? Anybody excited about doing ML research on, again, this kind of alien landscape of data where you really have to figure out what's working from first principles and obviously the work we do should have very, very large impact. So definitely not restricted to people who have a biology background, you know, people who just like tackling very challenging machine learning problems and are, you know, open to learning the minimum amount of biology necessary to, like, make progress, I think, you know, would be great candidates. Talking to you guys reminds me a lot of the Leash Bio Labs, which I know that both of you are part of the recruiter.

Starting point is 01:08:59 and mafia. You know, I'm not. Yeah. Well, yeah. I'm bringing up you on the show in the future, too. So, yeah, yeah, we're looking forward to that. But, like, it's interesting because both of you seem to have really similar philosophies and that, like, you have deep convictions that, like, you're just going to start collecting

Starting point is 01:09:17 data before you know this is going to work. And you are going to just brute force it, go, go, go. And eventually, it will work. And, you know, you have signs. I don't know. I think that's really impressive. I wonder, is there something about recursion, which is in the water, which has led to this sort of thinking of just like we're going to commit to doing things at scale. And it may not work at first.

Starting point is 01:09:38 You have to hit a certain point before it will. I mean, we failed a lot at the beginning. Yeah. You have recursion. At regurgia. Yeah. And so you, and we had, I said we had to build it from first principles. We really did.

Starting point is 01:09:50 And so we spent many years trying to figure out like, what should the data look like? Ian and myself, we're all involved in kind of platform development. how to design these datasets, how to design the experiments, iterative cycles over the years seeing things that did work, things that didn't work. And so at the end of, you know, coming out of recursion, I think what a lot of folks there had was like an understanding of, what are the things we need to think about so that even if I want to design a different

Starting point is 01:10:18 data set, you know, today, but it's like totally different. What are the things that we learned and we had to learn like over mistakes, over, like, not mistakes, but like trial and error, basically, over that many months, that we would try to insert in our new approach. And so I don't know that every, everything that I've predicted at, noetic, in terms of, like, how to generate the dataset has been important necessarily. I know that we could start at the very beginning and say, okay, well, let's make sure we do these 10 things.

Starting point is 01:10:46 I know every one of these 10 things was important before. Let's at least make sure we do these 10 things. I don't know that all 10 things are important for us today, but I would, you know, presume that, you know, many of them are, and it lets you sort of leapfrog that process of trial and error a little bit. Certainly we do have trial and error still, but hopefully we're not having to, you know, solve like, you know, 15 problems. Maybe we're only solving, you know, three problems, four problems, over tell. So for small biotech startups, which are probably in the A space who are collecting their own data, their own data mode, like, do you have any advice or

Starting point is 01:11:21 any suggestions about how to be more successful there? I think you sort of need to, I mean, you think ahead to, okay, what am I trying to do on the machine learning side? And like, what is the right data for solving this problem? I think oftentimes I see like a lot of companies are like, okay, well, I want to generate X data set. I'm just going to generate X data set and I'm going to do machine learning on that. Like, that might not be the right data set. You might not have designed it the right way. You know, It doesn't follow that, like, any data set as a machine, or any data set. It doesn't pull that that data says that's all the problem we're trying to solve.

Starting point is 01:11:59 So, I don't know, for me, it's really, and even following your way, it was, okay, what problem are we trying to solve? And then what are the data that are going to help solve that problem? And rather than, like, you know, going from the data directly to try to solve. I also, sorry, I also had a quick piece of advice, which is, like, you know, pay attention to where the technology is. and where it's changing rapidly. So, you know, I finished my PhD in 2016. I did a lot of looking at spatial RNA like via this technique called in situ hybridization,

Starting point is 01:12:35 same technique that is like at the base of what we're doing. I could look at maybe two genes at a time on a single sample and that took me a full week of manual work. And, you know, I came to Noetic like, five years later, six years later, and all of a sudden, you know, there are platforms where you can look at 1,000 genes or 20,000 genes at once. You know, it's a single machine that can run this assay. It's expensive, but it's just like data beyond the wildest dreams of Dan Baer in 2016. And that is only improving, like, rapidly. So I think it's important to see what the technology

Starting point is 01:13:21 of today, you know, allows and also where it's going in terms of what data to generate. And what does that pitch look like? So I'm going to generate data for a year and a half and then I spend $50 million and then... If it wasn't 50, it was maybe close to the 10. But if... So, yeah, I mean, it isn't. So, yeah, so you have to do that if you, if, I mean, if you're going into a regime where there's no data, yeah, and you want to do something different, then, I mean, there's no shortcut.

Starting point is 01:13:51 to it, right? You're going to have to generate the data set. And so you're not going to know the answer until it's there. And I mean, that's why a lot of companies are not going into that space where there are no data sense because, you know, I think it can be challenging to do that. I mean, I think a lot of smaller biotech AI startups will try this pattern where they first will, we're either start with a public open source data set or they will try a pilot will internally collect a small and out of data and see if something works or something it doesn't. And oftentimes there's almost like a critical point where below this, you're just not going to get a new signal. And you have to have conviction that you need to collect up to a certain point before you start like really driving something like fundamentally valuable.

Starting point is 01:14:39 Yeah. Yeah. I mean, imagine trying to train a foundation model on not enough data. Yeah. And then that's it's sort of your little bit of your clinical trial. GPD 2, GPD 3, GPT 1, 2, and 3, like, there was a clear progression there. As each one of them, you could see there was something which worked with scale, and there was this insight to, oh, we're going to scale this off.

Starting point is 01:15:02 You know, sometimes of biological data, like, the process of collecting lots of data is just very expensive to begin with. You can't just take something off the shelf and expect that you're going to hit the threshold of, you know, GP3, like, usefulness. Yeah. Yeah. So, yeah. So, yeah.

Starting point is 01:15:17 So, yeah. So, yeah. It definitely takes conviction. I think, you know, it also takes sort of like a scientific belief. Then there's a lot out there, like, that we just don't know yet and that you're not going to capture the biology you need to by having right now, like, an agent that reads all of the biological literature. Because, again, that's just like a tiny slice of what's out there. Like, this is, I don't know if it's a great analogy or if I'm going to botch the history year. But like, in astronomy, it was required, like, Tico Prahe, like,

Starting point is 01:15:51 collecting this enormous amount of astronomical data at his observatory, that then was the substrate for Kepler, you know, figuring out the first laws of motion of the planets. And then, you know, that was superseded by, like, Newton's laws and so forth. But, like, I don't, I sometimes don't know how you even get started without, like, this large repository of really high quality data. be in with and you know maybe there's like a tragedy of the commons problem here of like who's going to generate that data and who's going to capture the value of it but I'm very glad that we're

Starting point is 01:16:26 taking that bet and you know we're seeing it pay off yeah I mean this is not my expertise but if you know hypothetically speaking yeah how much of PDB do you need to train I mean there was some people I always do that yeah and then you can get some pretty good models with I think one person yeah really and there are people going back in the 1990s argued that the PDB was already complete in the sense of like if you had a sufficiently smart algorithm you could have done a pretty reasonable job at protein folding even back then. Interesting. So you don't need a lot to get a pretty big boost, but the community was sort of independently collecting PDB data for quite some time. Yeah. Without necessarily

Starting point is 01:17:08 being convicted that this was going to lead to solving protein folding. Yeah, but then it was also usually, quite, most of those structures were quite useful in and of themselves. So maybe that's the charter point. Oftentimes, just knowing a protein was very helpful for a useful dataset. And we did see, we did see a transition from like early data, but how many samples to do. I'm guessing probably on the order of a few hundred before there's like, yeah, there was a, there was definitely a moment, like, very soon after I joined where, like, we, the data set just kind of doubled in size overnight because there was like a huge bullies and like the models immediately got a lot better at that point. And now we'd run these more controlled experiments of seeing,

Starting point is 01:17:51 you know, what happens if you train on 10% of the data versus 40% versus 100%, what happens if you hold out all of the pancreatic cancer or all of the breast cancer. So, you know, we have a much better idea of what kind of diversity in scale we need now. I guess I would say if we were sticking to cancer, maybe we're not like that far off. I think, you know, again, if we end up generating a few hundred patients in a bunch of major and, you know, some minor indications, which we're, you know, going to do this year, like maybe that's enough to generalize to kind of all cancer, because there is a lot of shared biology in, you know, cancer and immune cells across different tissues and different, you know, mutations and so forth. But if you think,

Starting point is 01:18:41 about all of the disease biology that there is for a model to learn, you know, maybe that's like another order of magnitude. But even being able to solve all cancer biology would be a pretty impressive. Yeah, to carry cancer would be great. Oh, if it's all cancer biology,

Starting point is 01:18:58 and do not say cure care, or so you never face. But yeah, at least if you go at least the bugle-ball, and just sort of a, like, just take one drug, if you could look at one drug mechanism across the whole of oncology, that's incredibly powerful. I mean, Imagine what Merck has done with K-Truda.

Starting point is 01:19:15 Like, Merck has run hundreds of trials with K-Truda. Like, it might even be over 1,000 trials of Ketriah in different populations to find, you know, all these different indications. Okay, the subset of ovarian cancers, the subset of lung cancers, the subset of colon cancers. That's all been done, you know, by enrolling trials. If you can look at that biology from model embeddings and at least have a very well-defined starting point for, okay, if I'm going to run a trial, it doesn't have to be as broad as it would need to be

Starting point is 01:19:49 if I didn't have any answer, then that can be a really powerful tool for, you know, a diversity of mechanisms. Yeah. Maybe it's just like last point, like going back to the virtual cell hot takes. Like, you know, if your goal is to build like an actual mechanistic model

Starting point is 01:20:06 of an individual cell and then build up from one cell to an entire tissue and then, you know, tissue to patient and so forth. Like, you might need a lot more data and a lot more data modalities than, you know, just like gene expression or something like that. But, you know, we're taking much more of like a top-down approach of we're trying to first solve the problem of what is determining heterogeneity among actual patients and which of that variability is predictive. of drug response. And my intuition is that you don't need to model the mechanism at the sub-cellular level necessarily to solve that problem of which patient should get which drug or which targets are important in which patients. And I saw a similar debate play out in neuroscience and

Starting point is 01:20:59 computational neuroscience where for a long time people were really trying to build these bio-physical models of individual neurons, and then they were going to stitch them together into models of, you know, the brain and so forth. And what actually ended up working in, you know, in terms of building computational models of the brain and behavior is this abstraction, you know, we're just going to treat individual neurons as, you know, linear, non-linear units and, you know, put them together in neural networks that are connected by, you know, linear weight matrices, and stack a bunch of layers together and then build neural network models of the brain

Starting point is 01:21:40 that abstract away kind of all of the details of biophysically what a neuron is doing. And those are now by far the most predictive models of how a given neuron is going to respond to real-world stimuli in a real brain. And I think that my bet is that the same that's going to be true for these models too is that by modeling sort of,

Starting point is 01:22:04 at the level of functional tissue where you have a bunch of cells interacting in like a disease context that that's going to get you to the problem of predicting kind of the patient level behavior much faster than trying to first model a cell and then stitch a bunch of those cells together. Yeah, that makes sense to me. It's good analogy of the good analyses. Do you have any call to action for the listeners? Yeah, I mean, I would say one, everyone should be excited about biology. You know, sometimes a lot of my hot takes on X recently are just that I feel like there's a huge amount of enthusiasm in sort of like the mainstream tech ecosystem and like people aren't really following a lot of like what's happening in the biology space. But at the same time,

Starting point is 01:22:51 like you're hearing, you know, French your lab saying we're going to cure cancer. And people should actually look at the folks working on curing cancer or working on aging or working on areas of biology. These are really exciting, you know, problems. there are real, like, significant NL problems in the space. One called action is with love for people to just, like, be more stoked about learning, about applications, machine learning in, like, biological sciences and, like, solving some of these hard problems because I think these are the problems that are going to, like, massively impact humanity in, like, the next 10 years.

Starting point is 01:23:24 And we're just, like, really at the very beginning. Like, you know, maybe we're in the, like, first inkling of the chat GPT moment for bio, but it's, like, very much just the very big. beginning. So we'd like that's in line with that to like really dig in and learn more about the details. I think, you know, a lot of the times it's presented as we have these protein folding models. We have these binding models. You know, we have AI for science agents that are, you know, like reading all of the literature and automating these computational biology workflows. And I think it's important to realize that there are a lot of problems in AI for biology, AI for biochemistry, et cetera, and some of them,

Starting point is 01:24:10 and they're very important. But like solving any one of those is not going to like solve the problem of how do we develop better therapeutics. And, you know, we're focused on, you know, a pretty particular slice of that process, which is, again, translating things that we know work well in some patients into actual, like, successful drug trials where we know exactly which patients to give them to. And that requires building foundation models at a particular level, you know, the patient level. But people should not be under the impression that, like, this is all going to be solved immediately because, you know, AI agents, like LLMs, are going to just read the literature and figure out what the right drug is. Like, there are a lot more data to generate. There's a

Starting point is 01:24:58 lot more ML problems to solve. And there's the need to translate those methods into actual successful drugs. And there's a lot of different places to contribute. It's a lot to do. Yeah, I'm good. Great. Thank you very much. Here we are.

Latent Space: The AI Engineer Podcast - 🔬 Training Transformers to solve 95% failure rate of Cancer Trials — Ron Alfa & Daniel Bear, Noetik

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.