Microsoft Research Podcast - Abstracts: Zero-shot models in single-cell biology with Alex Lu
Episode Date: May 22, 2025The emergence of foundation models has sparked interest in applications to single-cell biology, but when tested in zero-shot settings, they underperform compared to simpler methods. Alex Lu shares ins...ights on why more research on AI models is needed in biological applications.Show notes
Transcript
Discussion (0)
Welcome to Abstracts, a Microsoft research podcast that puts the spotlight on world-class
research in brief.
I'm Gretchen Husinga.
In this series, members of the research community at Microsoft give us a quick snapshot for
a podcast abstract of their new and noteworthy papers.
On today's episode, I'm talking to Alex Liu, a senior researcher at Microsoft Research
and co-author of a paper called Assessing the Limits of Zero Shot Foundation Models
in Single Cell Biology. Alex Liu, wonderful to have you on the podcast. Welcome to Abstracts.
I'm really excited to be joining you today.
So let's start with a little background of your work in just a few sentences.
Tell us about your study and more importantly, why it matters.
Absolutely.
And before I dive in, I want to give a shout out to the MSO research intern who actually
did this work.
This was left by Katya Katsyovska, who interned with us two summers ago in 2023, and she's the lead author on the study. But basically
in this research, we study single cell foundation models, which have really recently walked
the world of biology, because they basically claim to be able to use AI to unlock understanding
about single cell biology. Biologists for a myriad of applications,
everything from understanding how single cells differentiate
into different kinds of cells to discovering new drugs
for cancer, will conduct experiments
with the measure of how much of every gene
is expressed inside of just one single cell.
So these experiments give us a powerful view
into the cells' internal state. But measurements from these experiments give us a powerful view into the cells of the Antonio
state. But measurements from these experiments are incredibly complex. There are about 20,000
different human genes. So you get this really long chain of numbers that measure how much
there is of 20,000 different genes. So do what I mean in this really long chain of numbers
is really difficult. And single cell foundation models claim to be capable
of unraveling deeper insects than ever before.
So that's a claim that these works have made.
And in our recent paper, we showed that these models
may actually not live up to these claims.
Basically, we showed that single cell foundation models
perform worse in settings that are fundamental
to biological discovery than much simpler machine learning
and statistical methods that were used in the field before single-cell foundation models
emerged and are the go-to standard for unpacking meaning from these complicated experiments.
So in a nutshell, we should care about these results because it hides implications
on the toolkits that biologists use to understand their experiments. I will suggest that single-cell
foundation models may not be appropriate for practical use structure, at least in the discovery applications
that we cover.
Well, let's go a little deeper there. Generative pre-trained transformer models, GPTs, are
relatively new on the research scene in terms of how they're being used in novel applications,
which is what you're interested in, like single cell biology. So I'm curious, just sort of as a foundation, what other research has already been done in this area?
And how does this study illuminate or build on it?
Absolutely. Okay. So we were the first to notice and document this issue in single cell foundation
model specifically. And this is because we have proposed a valuation method that while are common in other areas
of AI, has just been commonly used to evaluate single cell foundation models.
We performed something called zero-shot evaluation on these models.
Prior to the work, most works evaluated single cell foundation models with fine tuning.
And the way to understand this is because single cell foundation models with fine tuning. And the way to understand this is because single cell foundation models are trained in a way
that tries to expose these models
to millions of single cells.
But because you're exposing it to a wide
large amount of data, you can't really rely upon
the state of being annotated or labeled
in any particular fashion then.
So in order for them to actually do the specialized tasks
that are useful for biologists, you
typically have to add on a second training base, we call this the fan tuning base, where
you have a smaller number of single cellers, but now they are actually labeled with the
specialized tasks that you want the model to perform.
So most people typically evaluate the performance of single cell models after
they fine-tune these models. However, what we notice is that this evaluation of these
fine-tuned models has several problems. First, it may not actually align with how these models
are actually going to be used by biologists then. A critical distinction in biology is
that we're not just trying to interact with an
agent that has access to knowledge through its pre-training, we're trying to extend
these models to discover new biology beyond the sphere of influence.
And so in many cases, the point of using this model, the point of an analysis, is to explore
the data with the goal of potentially discovering something new about the single cell that the
biologist worked with that they weren't aware of before. So in these kinds of cases, it is really tough to
fine tune a model. There's a bit of a chicken and egg problem going on. If you don't know,
for example, there's a new kind of cell in the data, you can't really instruct the model to
help us identify these kinds of new cells. So in other words, bio-initially these models for those tasks essentially becomes
impossible to them.
So the second issue is that evaluations on bio-initially models can sometimes lead us
in our ability to understand how these models are working. So for example, the claim behind
single cell foundation model papers is that these models learn a foundation of biological
knowledge by being exposed to millions of single cells
and it's rich, rich, and in faith, right? But it's possible when you try and tune a model,
it may just be that any performance increases that you see using the model is simply because
that you're using a massive model that is really sophisticated, really large, and even if any exposure to any cells at all then,
that model is going to do perfectly fine then.
So going back to our paper,
what's really different about this paper
is that we propose zero-shot evaluations for these models.
What that means is that we do not fine tune the model at all,
and instead we keep the model frozen
during the analysis step.
So how we specialize
it to be downstream tasking instead is that we extract the model's internal embedding
of single cell data, which is essentially a numerical vector that contains information
that the model is extracting and organizing from input data. So it's essentially how
the model perceives single cell data and how it's organized in its own internal
state. So basically, this is a better way for us to test the claim that single cell
foundation models are learning foundational biological insights, because if they actually
are learning these insights, they should be present in the models in button space even
before we found out the model.
Well let's talk about methodology on this particular study. You focused on assessing
existing models in zero-shot learning for single cell biology. How did you go about
evaluating these models?
Yes. So let's dive deeper into how zero-shot evaluations are conducted. So the premise
here is that we rely upon the fact that if these models are body
learning foundational biological insights, if you take the model's internal representation
of cells, then cells that are biologically similar should be close in that internal representation,
where cells that are biologically distinct should be further apart.
And that is exactly what we tested in our study.
We compared two popular single cell foundation models.
And importantly, we compared these models
against order and reliable tools that biologists have
used for regulatory analyses.
So these include simpler machine learning methods,
like SCBR, statistical algorithms like Harmony,
and even basic data preprocessing steps,
just like filtering
your data down to a mobile bus subset of gene sense.
So basically we tested embeddings from our two single cell foundation models against
this baseline in a variety of segments and we tested the hypothesis that biologically
similar cells should be similar across these distinct methods across the datasets.
Well, and as you did the testing, you obviously were aiming towards research findings, which
is my favourite part of a research paper. So tell us what you did find and what you
feel the most important takeaways of this paper are.
Absolutely. So in a nutshell, we found that these two newly proposed single-cell
foundation models substantially underperformed compared to other methods. So to contextualize
why that is such a surprising result, there is a lot of hype around these methods. So basically,
I think that it's a very surprising result
given how high these models are and how people were adopting them then. But all results basically
caution that really they shouldn't really be adopted for these use purposes.
Yeah, so this is serious real world impact here in terms of if models are being adopted and adapted in these applications, how reliable
are they, etc. So given that, who would you say benefits most from what you've discovered
in this paper and why?
Okay, so two ways, right? So I think this has at least immediate implications on the
way that we do discovery in biology. And as I've
discussed, these experiments are used for cases that have practical impact, drug discovery
applications, investigations into basic biology. But let's also talk about the impact for
methodologists, people who are trying to improve these single cell foundation models, right?
I think at the base, they're really excited
proposals, because if you look at some of the things that prior and less sophisticated
methods couldn't do, they tended to be more briefly spoke. So the excitement of single
cell foundation models is that you have this general purpose model that can be used for
everything and while they're not living up to that purpose, just now, just currently,
I think that it's important that we continue
to bank onto that vision, right?
Yeah.
So if you look at our contributions in that area,
we have single-cell foundation models
of a really new proposal.
So it makes sense that we may not
know how to fully evaluate them just yet then.
So you can view our work as basically being a step
toward more rigorous evaluation from these models.
Now that we did this experiment, I think the methodologists know to use this as a signal You can view our work as basically being a step towards more rigorous evaluation from these models.
Now that we did this experiment, I think the methodologists know to use this as a signal
on how to improve the models and if they're going in the right direction.
In fact, you are seeing more and more papers adopt zero-shot evaluations since we put out
our paper then.
This essentially helps future computer scientists that are working on single-cell foundation
models know how to train better models.
That said, Alex, finally, what are the outstanding challenges that you identified for zero-shot
learning research in biology? And what foundation might this paper lay for future research agendas
in the field?
Yeah, absolutely. So now that we've shown single cell foundation models don't necessarily
perform well, I think the natural question on everyone's mind is how do we actually
train single cell foundation models that live up to that vision, that can perform in helping
us discover new biology then. So I think in the short term, yeah, we'll actively be investigating
many hypotheses in this area. So for example, my colleagues, Leon Crawford
and Alvar Muni, who were co-authors in the paper,
recently put out a preprint understanding
how training data composition impacts model performance.
And so one of the surprising findings that they had
was that many of the training data set
that people used to train SQL Server Foundation models
are highly redundant, to the point that you can even sample just a tiny fraction of the data and get basically the same performance. But you can also
look forward to many other explorations in this area as we'll continue to develop this research
agenda then. But also zooming out into the bigger picture, I think one major takeaway from this
paper is that developing AI methods for biology requires thought about the
context of use. This is obvious for any AI method then, but I think people have gotten just too used
to taking methods that work out there for natural vision or natural language, maybe in the consumer
domain, and then extrapolating these methods to biology and expecting that they will work in the same way.
So for example, one reason why solar shock evaluation would not routine practice for
single cell foundation models prior to our work, I mean, we would have to first establish
that as a practice for the field, was because I think people who have been working in AI
for biology have been looking to these more mainstream AI domains to shape the work
stuff. And so with single-cell foundation models, many of these models are adopted from large language
models with natural language processing, recycling the exact same architecture, the exact same code,
basically just recycling practices in that field. So when you look at like practices in like
more mainstream domains, zero-shot evaluation is definitely explored in those domains, but it's more of like, a niche instead of being
considered central to model understanding.
So again, because biology is different from mainstream language processing, it's a scientific
discipline, zero-shot evaluation becomes much more important, and you have no choice but
to use these models, So in other words,
I think that we need to be thinking carefully about what is that makes training a model for
biology different from training a model for example for consumer purposes.
Alex Liu, thanks for joining us today and to our listeners, thanks for tuning in. If you want to read this paper, you can find a link at aka.ms forward slash abstracts or
you can read it on the genome biology website.
See you next time on abstracts. you