Microsoft Research Podcast - Abstracts: Zero-shot models in single-cell biology with Alex Lu

Episode Date: May 22, 2025

The emergence of foundation models has sparked interest in applications to single-cell biology, but when tested in zero-shot settings, they underperform compared to simpler methods. Alex Lu shares ins...ights on why more research on AI models is needed in biological applications.Show notes

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to Abstracts, a Microsoft research podcast that puts the spotlight on world-class research in brief. I'm Gretchen Husinga. In this series, members of the research community at Microsoft give us a quick snapshot for a podcast abstract of their new and noteworthy papers. On today's episode, I'm talking to Alex Liu, a senior researcher at Microsoft Research and co-author of a paper called Assessing the Limits of Zero Shot Foundation Models in Single Cell Biology. Alex Liu, wonderful to have you on the podcast. Welcome to Abstracts.
Starting point is 00:00:42 I'm really excited to be joining you today. So let's start with a little background of your work in just a few sentences. Tell us about your study and more importantly, why it matters. Absolutely. And before I dive in, I want to give a shout out to the MSO research intern who actually did this work. This was left by Katya Katsyovska, who interned with us two summers ago in 2023, and she's the lead author on the study. But basically in this research, we study single cell foundation models, which have really recently walked
Starting point is 00:01:16 the world of biology, because they basically claim to be able to use AI to unlock understanding about single cell biology. Biologists for a myriad of applications, everything from understanding how single cells differentiate into different kinds of cells to discovering new drugs for cancer, will conduct experiments with the measure of how much of every gene is expressed inside of just one single cell. So these experiments give us a powerful view
Starting point is 00:01:44 into the cells' internal state. But measurements from these experiments give us a powerful view into the cells of the Antonio state. But measurements from these experiments are incredibly complex. There are about 20,000 different human genes. So you get this really long chain of numbers that measure how much there is of 20,000 different genes. So do what I mean in this really long chain of numbers is really difficult. And single cell foundation models claim to be capable of unraveling deeper insects than ever before. So that's a claim that these works have made. And in our recent paper, we showed that these models
Starting point is 00:02:14 may actually not live up to these claims. Basically, we showed that single cell foundation models perform worse in settings that are fundamental to biological discovery than much simpler machine learning and statistical methods that were used in the field before single-cell foundation models emerged and are the go-to standard for unpacking meaning from these complicated experiments. So in a nutshell, we should care about these results because it hides implications on the toolkits that biologists use to understand their experiments. I will suggest that single-cell
Starting point is 00:02:43 foundation models may not be appropriate for practical use structure, at least in the discovery applications that we cover. Well, let's go a little deeper there. Generative pre-trained transformer models, GPTs, are relatively new on the research scene in terms of how they're being used in novel applications, which is what you're interested in, like single cell biology. So I'm curious, just sort of as a foundation, what other research has already been done in this area? And how does this study illuminate or build on it? Absolutely. Okay. So we were the first to notice and document this issue in single cell foundation model specifically. And this is because we have proposed a valuation method that while are common in other areas
Starting point is 00:03:27 of AI, has just been commonly used to evaluate single cell foundation models. We performed something called zero-shot evaluation on these models. Prior to the work, most works evaluated single cell foundation models with fine tuning. And the way to understand this is because single cell foundation models with fine tuning. And the way to understand this is because single cell foundation models are trained in a way that tries to expose these models to millions of single cells. But because you're exposing it to a wide large amount of data, you can't really rely upon
Starting point is 00:03:55 the state of being annotated or labeled in any particular fashion then. So in order for them to actually do the specialized tasks that are useful for biologists, you typically have to add on a second training base, we call this the fan tuning base, where you have a smaller number of single cellers, but now they are actually labeled with the specialized tasks that you want the model to perform. So most people typically evaluate the performance of single cell models after
Starting point is 00:04:25 they fine-tune these models. However, what we notice is that this evaluation of these fine-tuned models has several problems. First, it may not actually align with how these models are actually going to be used by biologists then. A critical distinction in biology is that we're not just trying to interact with an agent that has access to knowledge through its pre-training, we're trying to extend these models to discover new biology beyond the sphere of influence. And so in many cases, the point of using this model, the point of an analysis, is to explore the data with the goal of potentially discovering something new about the single cell that the
Starting point is 00:05:03 biologist worked with that they weren't aware of before. So in these kinds of cases, it is really tough to fine tune a model. There's a bit of a chicken and egg problem going on. If you don't know, for example, there's a new kind of cell in the data, you can't really instruct the model to help us identify these kinds of new cells. So in other words, bio-initially these models for those tasks essentially becomes impossible to them. So the second issue is that evaluations on bio-initially models can sometimes lead us in our ability to understand how these models are working. So for example, the claim behind single cell foundation model papers is that these models learn a foundation of biological
Starting point is 00:05:42 knowledge by being exposed to millions of single cells and it's rich, rich, and in faith, right? But it's possible when you try and tune a model, it may just be that any performance increases that you see using the model is simply because that you're using a massive model that is really sophisticated, really large, and even if any exposure to any cells at all then, that model is going to do perfectly fine then. So going back to our paper, what's really different about this paper is that we propose zero-shot evaluations for these models.
Starting point is 00:06:16 What that means is that we do not fine tune the model at all, and instead we keep the model frozen during the analysis step. So how we specialize it to be downstream tasking instead is that we extract the model's internal embedding of single cell data, which is essentially a numerical vector that contains information that the model is extracting and organizing from input data. So it's essentially how the model perceives single cell data and how it's organized in its own internal
Starting point is 00:06:45 state. So basically, this is a better way for us to test the claim that single cell foundation models are learning foundational biological insights, because if they actually are learning these insights, they should be present in the models in button space even before we found out the model. Well let's talk about methodology on this particular study. You focused on assessing existing models in zero-shot learning for single cell biology. How did you go about evaluating these models? Yes. So let's dive deeper into how zero-shot evaluations are conducted. So the premise
Starting point is 00:07:22 here is that we rely upon the fact that if these models are body learning foundational biological insights, if you take the model's internal representation of cells, then cells that are biologically similar should be close in that internal representation, where cells that are biologically distinct should be further apart. And that is exactly what we tested in our study. We compared two popular single cell foundation models. And importantly, we compared these models against order and reliable tools that biologists have
Starting point is 00:07:53 used for regulatory analyses. So these include simpler machine learning methods, like SCBR, statistical algorithms like Harmony, and even basic data preprocessing steps, just like filtering your data down to a mobile bus subset of gene sense. So basically we tested embeddings from our two single cell foundation models against this baseline in a variety of segments and we tested the hypothesis that biologically
Starting point is 00:08:19 similar cells should be similar across these distinct methods across the datasets. Well, and as you did the testing, you obviously were aiming towards research findings, which is my favourite part of a research paper. So tell us what you did find and what you feel the most important takeaways of this paper are. Absolutely. So in a nutshell, we found that these two newly proposed single-cell foundation models substantially underperformed compared to other methods. So to contextualize why that is such a surprising result, there is a lot of hype around these methods. So basically, I think that it's a very surprising result
Starting point is 00:09:05 given how high these models are and how people were adopting them then. But all results basically caution that really they shouldn't really be adopted for these use purposes. Yeah, so this is serious real world impact here in terms of if models are being adopted and adapted in these applications, how reliable are they, etc. So given that, who would you say benefits most from what you've discovered in this paper and why? Okay, so two ways, right? So I think this has at least immediate implications on the way that we do discovery in biology. And as I've discussed, these experiments are used for cases that have practical impact, drug discovery
Starting point is 00:09:52 applications, investigations into basic biology. But let's also talk about the impact for methodologists, people who are trying to improve these single cell foundation models, right? I think at the base, they're really excited proposals, because if you look at some of the things that prior and less sophisticated methods couldn't do, they tended to be more briefly spoke. So the excitement of single cell foundation models is that you have this general purpose model that can be used for everything and while they're not living up to that purpose, just now, just currently, I think that it's important that we continue
Starting point is 00:10:26 to bank onto that vision, right? Yeah. So if you look at our contributions in that area, we have single-cell foundation models of a really new proposal. So it makes sense that we may not know how to fully evaluate them just yet then. So you can view our work as basically being a step
Starting point is 00:10:42 toward more rigorous evaluation from these models. Now that we did this experiment, I think the methodologists know to use this as a signal You can view our work as basically being a step towards more rigorous evaluation from these models. Now that we did this experiment, I think the methodologists know to use this as a signal on how to improve the models and if they're going in the right direction. In fact, you are seeing more and more papers adopt zero-shot evaluations since we put out our paper then. This essentially helps future computer scientists that are working on single-cell foundation models know how to train better models.
Starting point is 00:11:06 That said, Alex, finally, what are the outstanding challenges that you identified for zero-shot learning research in biology? And what foundation might this paper lay for future research agendas in the field? Yeah, absolutely. So now that we've shown single cell foundation models don't necessarily perform well, I think the natural question on everyone's mind is how do we actually train single cell foundation models that live up to that vision, that can perform in helping us discover new biology then. So I think in the short term, yeah, we'll actively be investigating many hypotheses in this area. So for example, my colleagues, Leon Crawford
Starting point is 00:11:46 and Alvar Muni, who were co-authors in the paper, recently put out a preprint understanding how training data composition impacts model performance. And so one of the surprising findings that they had was that many of the training data set that people used to train SQL Server Foundation models are highly redundant, to the point that you can even sample just a tiny fraction of the data and get basically the same performance. But you can also look forward to many other explorations in this area as we'll continue to develop this research
Starting point is 00:12:14 agenda then. But also zooming out into the bigger picture, I think one major takeaway from this paper is that developing AI methods for biology requires thought about the context of use. This is obvious for any AI method then, but I think people have gotten just too used to taking methods that work out there for natural vision or natural language, maybe in the consumer domain, and then extrapolating these methods to biology and expecting that they will work in the same way. So for example, one reason why solar shock evaluation would not routine practice for single cell foundation models prior to our work, I mean, we would have to first establish that as a practice for the field, was because I think people who have been working in AI
Starting point is 00:13:01 for biology have been looking to these more mainstream AI domains to shape the work stuff. And so with single-cell foundation models, many of these models are adopted from large language models with natural language processing, recycling the exact same architecture, the exact same code, basically just recycling practices in that field. So when you look at like practices in like more mainstream domains, zero-shot evaluation is definitely explored in those domains, but it's more of like, a niche instead of being considered central to model understanding. So again, because biology is different from mainstream language processing, it's a scientific discipline, zero-shot evaluation becomes much more important, and you have no choice but
Starting point is 00:13:43 to use these models, So in other words, I think that we need to be thinking carefully about what is that makes training a model for biology different from training a model for example for consumer purposes. Alex Liu, thanks for joining us today and to our listeners, thanks for tuning in. If you want to read this paper, you can find a link at aka.ms forward slash abstracts or you can read it on the genome biology website. See you next time on abstracts. you

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.