No Priors: Artificial Intelligence | Technology | Startups - The AI Will See You Now: Exploring Biomedical AI and Google’s Med-PaLM2 With Karan Singhal
Episode Date: May 18, 2023What if AI could revolutionize healthcare with advanced language learning models? Sarah and Elad welcome Karan Singhal, Staff Software Engineer at Google Research, who specializes in medical AI and th...e development of MedPaLM2. On this episode, Karan emphasizes the importance of safety in medical AI applications and how language models like MedPaLM2 have the potential to augment scientific workflows and transform the standard of care. Other topics include the best workflows for AI integration, the potential impact of AI on drug discoveries, how AI can serve as a physician's assistant, and how privacy-preserving machine learning and federated learning can protect patient data, while pushing the boundaries of medical innovation. No Priors is now on YouTube! Subscribe to the channel on YouTube and like this episode. Show Links: May 10, 2023: PaLM 2 Announcement April 13, 2023: A Responsible Path to Generative AI in Healthcare March 31, 2023: Scientific American article on Med-PaLM February 28, 2023: The Economist article on Med-PaLM KaranSinghal.com Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @thekaransinghal Show Notes: [00:22] - Google's Medical AI Development [08:57] - Medical Language Model and MedPaLM 2 Improvements [18:18] - Safety, cost/benefit decisions, drug discovery, health information, AI applications, and AI as a physician's assistant. [24:51] - Privacy Concerns - HIPAA's implications, privacy-preserving machine learning, and advances in GPT-4 and MedPOM2. [37:43] - Large Language Models in Healthcare and short/long term use.
Transcript
Discussion (0)
Welcome to No Pryors.
Today we're speaking with Karen Single, a researcher at Google where he is a leader on medical
AI, specifically on Med Palm 2, where he and a team are working on a responsible path
to generative AI in health care.
Google just announced the launch of its next generation language model, Palm 2, with improved
multilingual reasoning and coding capabilities, which is behind Med Palm 2.
So it's a great time to be speaking with Karen about everything he and his team are
working on. Karn, welcome to No Pryors.
Hey, guys.
So you've been working in this field for a long time.
Tell us about how you ended up working on medical AI at Google.
I think you started a fake news detector for using AI as a 19-year-old.
Yeah, that was one of my first AI projects.
I really got into AI thinking about how it could be used in socially responsible ways.
And for me, I was thinking around the time of the 2016 election that maybe a little bit
naively that we could, you know, AI-based solutions could be a bit of help for, you know,
things like misinformation and detecting that. I think in the longer run, I mean, I've thought of
that as kind of a more naive project. And I think in the longer run, I've been thinking more
about, you know, how it can help shape the trajectory of AI to be more beneficial and more broadly.
And I think for me, thinking about the medical setting has been motivated largely by thinking
about the fact that, you know, it's a great place to think about concerns around safety,
reducing hallucination and misinformation as well here, you know, thinking about how we can produce
medical question answers that are less likely to be harmful and all these kinds of things.
And, you know, that motivation, I think, has driven us to this point where really going for
the jugular in terms of thinking about how to train these models, make them better in the setting.
And so very excited about that kind of work.
Have you been working on the medical domain your entire?
time with Google? No. For me, this is just something I've gotten to the last year and
half. So I've been new to it. I've been learning from an excellent team, and it's been
amazing journey so far. What else has been the most interesting in your work at Google so
far? Yeah, I started working out in representation learning and federated learning. So this is
kind of the technology, representation learning in particular is kind of the technology underlying
a lot of the deep neural networks of today, including GP3, GPD4, and so on. And so this is
largely about learning representations of text, of images, of other modalities, such that you
can efficiently encode them, you can learn from them in the future, you can generalize the
new text and images and so on. So work for this really started, you know, back in the beginning
of the deep learning era, like in 2013, with convolutional neural networks and scaling those up
and work to VEC around 2015 and glove and all these things. And I think since then, you know,
we've been working on technologies around self-supervised learning,
around doing that in a privacy-preserving way.
And so, you know, after a couple of years of working on that at Google,
I had the opportunity to kind of quickly grow and start to lead a team.
I kind of got to the point where I was thinking, like,
okay, I've upskilled in a lot of ways.
I've gotten to the point where I can mentor many other researchers in a lot of ways.
And now it's a great time to be thinking about my next thing
and, you know, going for something ambitious in terms of shaping the trajectory of AI.
And so, you know, about a year and a half,
ago, a few of us had the idea to think about this medical setting as kind of a setting in which
these concerns are especially important, and there was a ripe opportunity to think about
this paradigm of foundation models and medical AI. And so within Google, we had the opportunity
to pitch what's called a Brain Moonshot, which is kind of like an internal incubator program
for ambitious research projects. And this is, you know, a lot of cool research projects that
you've heard of from Google have eventually come out of this program. As we pitched that, we got
accepted and funded. We got the ability to kind of get a bunch of compute to bring other folks
on board with the sponsorship of a bunch of leaders. And our first thing together was really
MedPom. And so that was a really amazing thing for us to be able to work on together.
Can you talk a little bit about Palm and how that's related to MedPalm and what Palm is to begin
with and then how MedPalm is different? Yeah, absolutely. I mean, so the original MedPom work
built on this model called Palm, which stands for Pathways language model. And so this is really
a infrastructure that Google has built to be able to scale up large language model training,
that is kind of Google-wide.
And so the first Palm model was released in 2022, which was kind of this 540B decoder-only
transformer model at the time, the largest densely activated model.
And, you know, it kind of realized these breakthrough achievements in code, in multilingual
capabilities, in reasoning.
And so, you know, I think a lot of the work.
with respect to kind of improving benchmarks specifically that we're seeing with like
POM, MetPOM, GPD4 recently, I think all comes down to a lot of the improvements that were made during
Palm, during the training of POM. And so, you know, shortly after POM, there was this Minerva work
where maybe like a few months after the POM work itself, people were able to show that on STEM
benchmarks, there was this kind of zero to 100 or zero to 60 at least effect where, you know, you went from
random chance to, you know, solid performance across a bunch of benchmarks.
And that laid the foundation for a lot of the work that Jason Way and others have had on,
you know, thinking about immersion abilities of large language models.
And so for us, that was part of the motivation for looking at multiple choice benchmarks as well
for MedPom.
And so for MedPom in particular, what we did was we took MetPom, this kind of general large
language model trained on WebScale data, and then kind of further aligned it to the medical
domain, we evaluated it base, but also thought about, like, given its limitations in long-form
medical question answering, thinking about things like safety, factuality, low likelihood of,
you know, outputting an answer with bias, what do we need to do to kind of better align that model
with this domain? And so really met Palm was an attempt to do that. Yeah, so basically it sounds like
you started off with Palm and Palm was tested against a bunch of different types of tests, right?
And so you could take the MCAT or you could take other types of effectively tests for professional accredation or for knowledge understanding.
And then it sounds like you then said, hey, this seems really interesting, right?
We're starting to get really good performance here.
And so can we do something that's in the medical domain specifically?
And that was MedPalm.
And so how did you do that alignment that you mentioned?
Was it some form of RLHF?
Was it some other form of fine tuning?
Was it how you train the model to begin with?
Like, what was the difference in terms of MedPalm versus Palm?
Yeah, absolutely. I mean, when we tried evaluating POM in the medical setting, we noticed there's out of box on multiple choice questions performing pretty well. And when we took a variation of POM, the FlanPom model, which was, again, worked from Jason Way and team, you know, this is an instruction to a model that's been trained to follow instructions better.
You know, again, it was able to perform quite well out of the box. And this was the first model that was able to perform above the pass mark on the METQA set of US Emily style exam.
questions. But then what we notice is that when we evaluated it on long-form medical question
answering, like actually getting the model to generate response, there was a lot of limitations.
And we compared that to clinician performance, it actually didn't do super well. And so really,
that was the motivation for that MedPom specific alignment. And so what we did there was really
thinking about instruction prompt tuning, which was this technique, which we explored in that MedPom
paper, which is kind of a data-efficient technique, a technique that doesn't require too much data,
to work because, you know, getting labels from doctors is expensive, which took a bunch of
expert demonstrations of good behavior from doctors and then use that to tune the parameters of
the model and do that in a way that's a little bit more learned than prompting, but also
less expensive than full fine-tuning. And so you did that. And then I guess if you start looking
at this now shift from palm to palm two and from med palm to med palm two, did you basically just
reproduce that same approach for med palm two or did you do anything different there?
Yeah, and this is the work of many folks other than myself, so just to preface it with that.
I mean, I think a few things that have been important have been, one is better objectives for true training and using something like a mixture of objectives, training objective.
And so that's been something that's, you know, been crucial.
And so this is work that started with UL2, a paper that was released also last year.
And then two other things that ended up being super important.
One is following the optimal scaling laws that were empirically evaluated again in this work.
And I think there's been a few works that have tried to do this from Open AI and DeepMind.
And again, this work tried to understand in this context, what are the optimal scaling laws with respect to data and compute, and how do you trade those things off?
And so this paper, again, found something similar to the chinchilla paper, which was that the total amount of data being used for these models was a relatively low compared to the number of parameters.
and that if we wanted to add in more data,
we could do so,
we could train a better model in a more compute-efficient way.
So this model also did that.
So that's an important improvement as well.
And the third thing was kind of improvements in the data
that were used to train the model.
And so this especially focused on multilingual data,
including more multilingual data and more code data
in a bunch of different coding languages as well.
Maybe just zooming out a little bit in terms of when you might apply
some of these different techniques to align a model to a specific domain.
Do you have a framework in your mind for why you might do full pre-training from scratch,
why you might do fine-tuning, why you might do a more efficient form of fine-tuning
and when you can just get away with prompt tuning or prompting?
How do you think about that?
Yeah, this is a great question.
I think it really comes down to the data that's available,
both in quantity and relevance to a particular topic.
I think if you have an infant supply of data
that's relevant for the specific problem
that you're trying to solve
then probably the best thing to do
is pre-train everything from scratch
and do everything N-10
and if you don't mind compute and money as well.
If you are working on a task in which
general pre-training data in the web
confers general advantages to that task,
and so that could be domain knowledge,
it could be general abilities like reasoning,
which is very applicable across many tasks,
which I think is, you know,
the case for medical reasoning as well.
Then I think it makes a lot of sense to build on top of an existing model,
especially if you're sensitive to things like cost or compute,
which most people are these days.
And so, you know, I think on that spectrum between things like prompting and
prompt tuning all the way to like full fine tuning,
I think it largely comes down to, so given an existing pre-trained model,
which is, I think, a big hurdle for most teams and most people.
to train a large scale pre-trained model.
The question is, do you prompt it?
Do you prompt tune it?
Do you full fine-tune it?
I think that largely comes down to data.
If you have three to five examples, let's say, then I would prompt it.
If you have maybe 10 or 50 examples, it would either be prompt tuning or fine-tuning.
I think generally in that realm, prompt tuning and fine-tuning perform similarly.
And I would prefer prompt tuning if you're at all sensitive to things like compute or cost.
If you care about the best performance and you have more than 100 examples, then probably fine-tuning is your best bet.
And it's not as expensive as full pre-training if you're doing it with a model that's been pre-trained, of course.
When you thought about evaluation of this model, you must have been surveying the landscape for the other sort of medical, probably, you know, science-specific and then medical-specific models.
Like, what's out there?
And how did you guys think about evalling and changing eval?
Yeah, absolutely.
And this is not the first work to explore the potential of a large language model in science or biomedicine.
And so I think it's important to acknowledge all the work that's come before us.
What we saw when we first came into this work and tried to understand what other models existed,
what other evaluation has been done, was that one, there was a few exciting works from other teams,
like Alactica or BioGPT and so on that we thought we could learn from and benefit from.
And so that was a really exciting thing to be able to see.
And the second thing we saw was that there was a bit of a shortage of kind of a systematic way of doing evaluation of these models.
And so it didn't feel like there was a systematic way to think about automated evaluation of the clinical knowledge of these models.
So, for example, via multiple choice benchmarks.
There were a few popular benchmarks like the MedQA benchmark, but it varied across paper what benchmarks they were studying.
In some cases, we felt like these benchmarks were not high quality.
And so that was one thing that we saw.
I mean, another thing that we saw, which was more acute, I think, was kind of a lack of detailed human evaluation across many of these works.
And so there was some steps in this direction that we were able to build on.
But I think for the most part, a lot of these models that have already existed didn't have kind of detailed human evaluation given a use case like medical question answering.
And so I think that to us was a significant limitation as we think about, you know, the real world potential of these models.
because, you know, when it comes down to it, we have to make sure that it actually serves humans
and is beneficial to humans.
And so for us, that was like a significant motivator for the Met Palm work being relatively
evaluation forward and thinking carefully about human evaluation with both physicians and
late people.
How do you think about where that bar is?
Because I think it's one of those things that, you know, having started a medical-centric
company before, on the one hand, you really want to be cautious in terms of providing people
back information.
accurate, right? And so when I was working actively on the operating side of color, we spent a lot
of time agonizing over the ensuring that there was also provided back to patients were as accurate
as possible, particularly in the context of anything that had to do with, you know, core genetic or other
information. The flip side of it is, you know, I remember I took my son to the emergency room when
he was younger and the doctor said, I'm going to go research this case and I'll be right back.
and I had to go ask him another follow-up question
and I go around the corner
and he's in his cube
literally Googling the symptoms, right?
And so it wasn't like he had some deep,
accurate source.
He was just making things up, right?
Effectively, right?
I mean, I've seen Google results
and you're kind of clicking around.
He was just clicking around.
I was like, oh my gosh.
And I could see the query, right?
So I knew he was looking at my kids' symptoms.
He had no idea, right?
And so there's this bar from,
hey, it needs to be incredibly accurate
and correct on through to,
well, the state of the art
actually isn't that amazing
in many circumstances.
And so how do you think about the right quality bar for these sorts of things in terms of real use application or practice?
That's an amazing, great question.
I think, as you said, there's two competing forces here, right?
Obviously, the stakes are high in the medical setting, and counterfactually, you want to make sure that the information you provide versus the information they would have otherwise gotten is actually high quality.
And so that's, you know, very, very careful as you think about, you know, any informational use case for these models.
At the same time, I think it's useful to recognize that people are searching for health information online and indecision is a decision as well.
And so, you know, a large percentage, roughly 10% of searches on the internet are for health information.
And some of these are coming from physicians themselves, as you mentioned in life.
And so, you know, I think that there is a responsibility to think about how to shepherd this technology carefully and safely towards that real world impact for patient health information.
And I think that is crucial as well.
And I think one thing that has been missing from our work so far is really grounded evaluations in a specific use case in a workflow to show that there is a benefit, both in terms of safety in the short term and in terms of kind of long-term patient outcomes as well.
And so, you know, I think that could be a health informational use case.
It could be other clinical workflows.
But, you know, I think that's one thing that we have to really make sure we do and, you know, are careful about before any kind of real-world use-case.
year. Yeah, that makes sense. Yeah, it definitely feels like in the medical world,
the importance of safety is paramount, and at the same time, there's very little cost
benefit being done anymore. And so there's, you know, interviews with Jansen and other sort
of giants of the industry basically saying, you know, we need, we need to think about the
benefit side, not just the cost side or the safety side. And what you're working on, I think,
is so important in terms of, if you think of the really big areas of societal impact, it's
what you folks are doing, right? If you could provide amazing health equity globally for everyone
in terms of this information.
How powerful is that?
I mean, that's fundamental.
And maybe education is the other one, right?
And it feels like AI really has a promise in both of these areas.
And so I always worry about, you know, how do you make sure that this can get to market
because it's so valuable, but there's going to be all these regulatory or safety obstacles
that in some cases are merited, but in some cases may actually prevent the emergence
of really important applications.
So I think it's awesome that you folks are working on all this and are being so thoughtful
about it.
How do you think about what workflows this is going to be most useful for?
So, you know, if you look at a lot of the bio or biomedical AI companies,
for some reason, they keep doing drug development.
A, why do you think that is?
Because this seems like such an important part of healthcare
and probably the bigger driver of healthcare efficacy.
And so, A, why is everybody just going and building another protein folding model
or, you know, molecular company?
And B, where do you think are the best applications of what you've been working on?
Yeah, these are great questions.
I think on the drug discovery front,
There's a bit of a playbook here, which any new company here looking for some revenue in the short term can follow.
And that could be a safe option.
Like there are, for example, existing AI augmented pipelines for doing things like given small molecule chemistry, predicting things like absorption or toxicity.
And it's kind of relatively easy to see that some of the more modern models, if placed into these pipelines, could perform better.
And so there's like a relatively safe bet there.
And so I think that probably accounts for a lot of the popularity of that as a use case.
I totally agree that like there is a kind of a chance to go for the jugular here in terms of health information, for example.
And so, you know, I think this is something that is going to be crucial.
But I think it is also something where a lot of the big players are more risk-averse.
And so, you know, the people who gate access to health information or provide access to health information are also thinking not
maybe super counterfactually about the positive benefits of things and they're thinking more
about the risks. And so, you know, I think that is also, you know, a concern that's been slowing
folks down both in terms of big companies and smaller companies. And, you know, I think
there is an opportunity to kind of think more about that and what that could look like. And
I think the company that, you know, gets that right or, you know, the set of companies that get
that right, I think we'll also have a seat at the conversation when it comes to policy and
regulation and things like that. And so they have the chance to shape, you know, what this looks
like for the future. And so, you know, I think that's going to be potentially quite impactful.
Yeah, it seems very exciting because if you look at healthcare, it's 20% of GDP. Pharmaceuticals are
about 20% of that. And then drug development is a fraction of that, right? So really what you folks
are focused on in terms of the types of models that you're building is at least, you know,
16% of GDP. You know, maybe it's more than that if some of the pharma stuff is, you know, more
clinical decision-making around who gets a certain pharmaceutical. Do you view this as a technology
that's initially a physician's assistant? Do you view it as something that helps with adjudication
of medical claims and billing? Like, there's so many places where this can sort of insert.
I'm just sort of curious, like, you know, where do you think you'll see this technology popping up
first? Yeah, I think we're already starting to see it in some clinical workflows when it
comes to documentation and building. I think there are a lot of companies and people thinking about
taking models like GPT4 and applying them in that setting. And I think that that is definitely
going to be something. I think that is also going to be something where players like Epic are going
to be able to partner with existing models and I think potentially deliver real value there.
And I think that's very exciting. I think that's something that also general domain models
will be potentially quite good at as well.
I think where there might be more of a need for specialized models
is when it comes down to kind of higher stakes workflows,
and I think that might look in the short term more like a physician's assistant.
And so imagine, for example, an agent that can work with the radiologist,
help them interpret a scan and leverage the benefits of AI to kind of help contextualize,
you know, a patient's medical record or any previous scans or different angles,
or different angles of scans that a patient has had
to help a radiologist to write a more accurate report.
I think that's something, that's the kind of thing,
which I think, you know, is in the sweet spot of both feasible today,
you know, leverages the benefits of AI
in terms of taking an additional context
and, you know, potential multimodality and all these kinds of things.
And it's also potentially in a sweet spot with respect to regulation as well.
And so I think that's, you know, something that could happen
in the shorter medium to short term.
How do you architect a model or workflow in this context?
to deal with things like HIPAA or patient privacy.
So I feel like healthcare data is unique from the context of what you're allowed to do in terms of who you send it to with what permissions from users.
So is it just you have to get the right user opt in and then it's fine?
Or is there extra work that you need to do in terms of blinding data or doing other things relative to the prompts or queries you're sending in?
Yeah, it's a great question.
I mean, I think this is something that people are just trying right now and just seeing what happens.
And it's kind of interesting.
People are just putting in patient information into GPD.
for, sometimes they're redacting information and all these kinds of things.
I mean, I think the ideal way to do this obviously is more privacy forward, I think,
in terms of building trust with relevant stakeholders and all these kinds of things.
You know, I think a starting point is just models that are able to automatically redact
very sensitive information from, you know, being sent further down a pipeline.
I think that's something that's like a very low-hanging fruit that, you know, many people can do.
there's also potential for HIPAA compliance within an organization.
So I know some organizations working in the space are partially HIPA compliant or are kind of trying to make that claim.
And I think that's something that's useful.
And I think that's something that we should work towards as well.
You know, I think in the longer run, I think a lot of these concerns, I think, are actually unclear in terms of how things will work out.
Like, I think there is kind of a bigger question about software of unknown provenance and how that will be.
be used and regulated, you know, in the future, there could be some kind of situation in
which, like, these things actually end up being very hard to scale up and apply in the real
world for, you know, high-stakes settings. But I think we'll probably end up with a scenario
where it'll become obvious that we need to and that we must, and that doing so will improve
patient outcomes. And so then I think it'll be time to have, like, a serious conversation
about what regulating these models and making sure privacy concerns are mitigated looks like.
And I think, you know, I think we have yet to have that discussion.
question. Yeah, hip is kind of interesting from the context up. It was an incredibly well-intention
piece of legislation, but the flip side of it is, it's really backfired in all sorts of ways
in terms of actual patient good. And you see that sometimes as well in terms of some of the
things that as you sign up for a clinical trial or other that you can actually do with your own data
where sometimes you're constrained from accessing it. I know of one example where somebody
had brain cancer, they had a glioblastoma, and it was a researcher at MIT. And he participated
in a small clinical trial, and then they were unable because of compliance to give him
his own data so that he could try and discover drugs against his own glioblastoma, his own brain
cancer, right? And so sometimes you see these very well-intentioned approaches in terms of
the protocols around a clinical trial or on HIPP or other things that are very well-intentioned
in terms of what they want to do, but then sometimes they may backfire as you start to enter the
modern data world, since I think that legislature is now almost 30 years old, right? And so I just
think it was set up for a world that's very different from what we have now in terms of
the liquidity and fluency of your ability to interact with information and, you know,
patients driving their own diagnoses and things like that. So, you know, my hope is that some
of these things get rebalanced in the AI world, since it could be so valuable to things like
what you're doing. I was just going to say that is the status quo. And you've also worked on the
areas of, you know, privacy preserving machine learning and federated learning. Those areas have
broadly taking a backseat to, let's say, like, scaling and aligning these more centralized
models? Like, do you see a place for that technology in this field?
Yeah, that's a great question. So, I mean, as I mentioned before, the first couple of years
in my career were really thinking more about privacy preserving machine learning and, you know,
federally learning and scaling that up and coming up with new algorithms that can learn new
things without, you know, sending all the data to a centralized place. And so in a lot of ways,
that has a very, very natural fit with this setting.
And part of my motivation, when I first started working on the setting,
was bringing in a lot of that expertise and bringing into that setting.
My sense is that I think one hesitation I have there is that I think a lot of,
you know, the most impactful work that's going to happen in this setting
is going to happen with the largest and most capable models,
at least for the next few years, it seems like.
And I think that, like, one thing that we're seeing is that even without
any patient health information put into these models.
Like, for example, MedPom and MedPom 2 are trained without any patient health information.
They, they're just kind of taking all the knowledge of POM and POM 2 and then just kind
of aligning them and making them behave in a certain way.
I think at the short term, there is this kind of thing that we see where models like GPD4 and
MetPom and MetPom 2 are able to do, you know, surprisingly well without any patient health
information.
And so it seems like we can get fairly far with that.
I mean, the longer run, I do think that, like, you know, coming back to that question of
data and how do you think about how to train a model depending on how much data you have
and how relevant you have, how relevant that data is, the ideal thing would be to have access
to all of the data, but in a privacy preserving way, in a way that people are in control
of their data, are able to revoke access to that data and are able to kind of benefit
from that shared understanding of their data. And so that's the kind of the ideal world.
But I think there are like real world obstacles to do.
doing federate learning on health data, which actually kind of increased activation energy
to the point where in the next few years, I doubt that like the biggest advances are going
to come from using federate learning approaches. But I think there are kind of intermediate solutions,
which people often sometimes refer to as federated, but maybe are not technically federated,
which are things like trusted execution environments or other environments in which models are running,
but, you know, don't have, the folks at Google don't have access to the data or direct access
to the models. And so there's this ability to kind of silo that from, you know, silo any patient
health information in the future potentially or, you know, any other data that's quite
sensitive from engineers or other folks at, you know, big companies or small companies.
Yeah, going back to perhaps more promising near-term areas of research, you've had this idea
of building a medical assistant as a sort of laboratory for safety and alignment research.
Can you talk about that?
Yeah, absolutely.
I mean, this is a lot of what got me thinking about the setting,
especially coming into the setting as somebody who, you know,
didn't have much of a medical background in terms of expertise.
I was really thinking about, you know,
what are the big things that I could do to help shape the trajectory of AI
or nudge it in a more beneficial direction?
And thinking about AI safety seriously in terms of both short-term and longer-term risks,
I think was important to me.
And so, you know, one thing I've been,
become more convinced about over time is this idea that, you know, many organizations right
now, Google, deep mind, anthropic, open AI are right now looking at the idea of a general
chat assistant and kind of instead of like doing alignment research in a vacuum, are looking
at that setting as a way in which we can think about kind of better refining these models and
better aligning them to human values. I think there's a good chance that this setting, this medical
setting, for example, medical question answering, or maybe more broadly, I think ends up
being a better scenario to study concerns about technical safety and to mitigate concerns
like misaligned with human values or hallucinations or things like that. And so, I mean, I think
this comes down to things like making sure the incentives are aligned with respect to releasing
products. So, for example, I think if any organization wants to release products in the space,
it actually needs to work on these problems more so than, I think, chat GPT. I think it also comes
down to kind of the stakes of the setting. I think everybody feels like the stakes of the setting
are high enough that everybody feels like these issues are especially important and there's no
debate about that. And I think there's also like some more subtle technical points like I think
one issue that, you know, alignment researchers are now working on is the idea of scalable
oversight, which means how do you give human feedback to a model when human feedback might not
be super well-informed or it might be unreliable because AI capabilities are starting to reach
human level.
And so when we start to get to that point, like things like RLHF start to fail and starts to
become unclear what to do.
And so I actually think the medical setting is a scenario in which this is already more obvious.
So you're already in a setting in which you need experts to be able to evaluate answers.
And one thing we're seeing with MedPom 2 is we get closer to physician-level performance on
medical question answering is that it's hard to tell the difference anymore. It's hard to tell
the difference between different models. It's hard to tell the difference between models and
physicians. And when you're at that point where it's uninformed oversight, then it becomes very
tricky to think about aligning to human values. And so that problem is super well motivated in the
setting. And that's something I'm very excited about. What do you think is a solution to that?
Because if you look at the gaming analog, which is probably a bad analog here, right? Once
machines were better than humans at things like Go or chess or other things, people
started learning off of the things that the machines were doing that were unique or creative
or different or the problem solving was very different. And if we really want this technology
to be incredibly valuable for medical applications, in some cases, we may end up with these
suggestions that will really work well, but that to your point, people may misinterpret or
misunderstand. And so how do you think about evaluating things when the AI will be better than
a person at medical adjudication or better than an expert?
Yeah, this is, I mean, this is, you know, really, really interesting question.
I don't think I have all the answers, but I think there are approaches that, you know,
people at Google and other organizations have been looking at.
I mean, I think a couple ideas here that I think are interesting and useful.
One is the idea of kind of self-refinement or self-critique of these models.
And so this is the idea that these models can take their own responses,
give critiques often guided with human feedback.
And so that's where the place where human feedback comes in.
some of these techniques, there's no human feedback.
In that case, I'm not sure that's as valuable.
Give critiques guided by human feedback and then use that to produce better answers.
That's one line of approaches.
I think a second line of approaches is around debate.
And so the idea here is that it's easier for a human to judge a debate between two different
answers than to judge the answer itself.
And so the kind of standard for verification is a bit lower here.
And so there's that ability for humans to be able to judge a response that potentially they wouldn't be able to judge otherwise via things like debate.
And so that's another thing.
I mean, another thing, which is people are working on as well is thinking about how we can take AIs that are less capable and use that, use them to kind of supervise other AIs that are more capable.
And so this is kind of the motivation.
I mean, this is partly the motivation of RLHF as well, even though it's about human feedback.
It's about training a reward model that takes into account human feedback, and then at that point, it's AI feedback from then on it, and then you use your RRL algorithm, and then you get rewards from your reward model.
RL, AIF, or constitutional AI, you know, kind of builds on that idea, but there's also limitations to that approach as well.
I mean, I think if you ask, you know, researchers across all these organizations, have we solved this problem?
Do we know what we're supposed to do?
I think most of them would say no.
It seems like a pretty consequential problem, so I'm excited for more folks to work on it.
Yeah, one thing that I feel like would also sort of be generated as a side effect of all this is just you end up with these really interesting closed loop data sets over time that may be unique outside of an EMR or somewhere else or a really robust medical record system.
Because if you have effectively physicians assistant or something else, and then you have the endpoint of what happened based on treatment, you actually have a really interesting retrospective data mining training set.
Yeah, I mean, I think that's like another opportunity for feedback for these models, which could have a huge impact on the world.
Yeah, it'll actually be data-driven medicine, which I think, you know, sometimes happens, but sometimes doesn't.
So it's very exciting.
I guess one more question is just, you know, there's amazing potential here.
And if I look at the history of medical technology, you know, in the 1970s, there was something known as the Mycine project at Stanford where they built an expert system, which was an old computer program of its time or a computer program of its day that was sort of a precursor to some other things that eventually happened in AI.
they had an expert system that outperformed all of Stanford's medical staff on the prediction
of the infectious disease that somebody had. So 40 years ago almost, we had a machine that
outperform people in terms of diagnosis, but it never got adopted. And so often when I look at
medical technologies, there's this almost like anti-adoption curve. In some cases, for the things
that may be most impactful, how has the medical field embraced or not embraced these AI
models. Is it different this time? Are people excited about it? Are they not excited? Does it really
depend on the type of physician? I'm just sort of curious, like, what the reaction has been from the
medical community to date. Absolutely. That's a really great question. You know, I think when we
started this brain moonshot, which we call it within Google, that was actually our motivation.
It was really to think about the fact that these models had already, kind of already exist,
and there was this opportunity to catalyze the medical AI community to really think about them
carefully, think about the promise there, and to catalyze the AI community to think about
how we can resolve any remaining limitations that would prevent real-world uptake.
And so this was really our goal.
And I think when we started this, there was much less conversation about the potential
for large language models and foundation models for healthcare.
And I think, I mean, partly because of, I think largely also because of other work that's
gone on, you know, with GPT4 and excitement around that, I think there's much, much more
conversation about, you know, how these models can be used in the setting in a productive way.
I think that's really, really exciting. And I think there's a lot of optimism, I see,
but there's also a lot of justified concern about, you know, the potential limitations of
these models and how we can, how we can get over them. Personally, I mean, from what I've seen,
from giving talks to different groups and chatting with different folks and different stakeholders,
I think there's like a, you know, a widely held optimism about this technology and about the
potential. But I think there's also kind of a little
bit of fear that I think, you know, people have seen in other domains, like, I think programmers
often feel a little bit of fear when they see GPD4, for example. And I think it's not necessarily
a fear that, like, jobs will be replaced in the short term or things like that, but it's more
of a fear of, like, look how fast things are moving. This is, this is nuts. Like, I think about
just an improvement from MedPOM 1, GPD4, MedPOM 2 in three months. Like, it's absolutely
crazy. And I think we, you know, it's definitely an inflection point for AI, as you guys know. And I think
it's definitely a good time to think about, you know, what are the most important problems we need to solve versus like getting caught up in the hype wave and, you know, forgetting to solve the most important problems as well.
I think back to a lot sort of point earlier, thinking about the actual benefits of these technologies at scale, if adopted, even at human and at some, you know, defined superhuman level, should we come to some sort of agreement as a democratic society?
about what Eval looks like is really important in that if you just think about what the status
quo is for somebody who has a complex case in a median background in America, what do they
know about the error matrix of their doctor and what, you know, in a field that's also
advancing in parallel to AI, like the specific rare condition that they have, it's not super
encouraging, right? And so in terms of leverage for a field where the status quo is not
sufficient, not as a comment on, you know, the class of physicians and researchers, but just in
terms of the quality of care that we want to be able to offer every person, it seems like
we want to set a reasonable safety case, not a unlimited safety case, right? Which is I think
is one of the things that has held back other sort of mission critical AI applications in the
past. Maybe on that note, like, one last ask for you in terms of encouraging some optimism,
you know, you're working on the state of the art in this field and thinking about the
barriers to the applied use, like five years from now, like, how do you hope we are using
large language models in the medical field? Yeah, I guess I think about this in two broad buckets.
I think there are two broad types of things that we can do for large language models in the
medical field. I think the first is increasing the standard of care very broadly. And so that
looks a lot like, you know, increasing access to health information, providing assistance to
physicians. So the radiology example I gave earlier, potentially clinical decision support,
like double checking a doctor's decision or quality assurance for a radiologist's report. So if,
you know, a radiologist is dictating a report, they say no plural effusion scene, but then it's written
down as plural effusion scene, then maybe an AI double-checked that and just make sure that's
what was intended.
I think augmenting telemedicine, I think, is kind of a short-term opportunity that I think
in the next five years is very achievable.
I think the other big bucket of things that is very much achievable is augmenting scientific
workflows.
And I think this could be a longer-term thing than five years, but I think there's also
short-term things that we can do as well.
So thinking about looking at correlations across modalities and existing data to find novel
biomarkers for existing diseases that we know about, or kind of using large language models as
research assistants.
So I think there's already a lot of work on the idea of literature search and augmenting
literature search with large language models.
I think there's a lot of opportunity there.
And that goes a little bit beyond, you know, what Metpom is likely going to do.
But I think that's something that I think, you know, it's going to be really promising with
respect to the future of AI.
Because I think it's the long term, when things go really well with AI,
It's going to be because we've solved a lot of the most pressing scientific problems of today.
And I think that's going to be because it augmented scientists.
It helped scientists.
It helped us figure out what are the things that we're missing.
And I think there's a lot of potential there.
So I'm also really excited about that in the long term.
Awesome.
Rapping up, is there anything else you think we should touch on?
Yeah, absolutely.
I mean, I think for real world uptake of these models, there are a few large language model capabilities,
in some cases that already exists, but we need to figure out the right way to do them.
And I think a few of them are just, you know, multi-modality, which is something that we were working on.
We kind of previewed last week at I.O.
And grounding and authoritative sources, I think is important as well, thinking about how these models can use tool form or like approaches to, for example, query authoritative medical information like a human would, but potentially better.
I think that's also, you know, one way of getting around the risk-averseness that you see in this area with respect to health information.
if you're able to attribute information to an authoritative source,
I think that has been something that has progressed this area in big companies before.
And so where, for example, Google is doing that with health information is largely because
it can attribute things to the Mayo Clinic and other organizations.
And so I think that's going to be really important for moving this forward.
I think also solid research, thinking about better ways to improve the ways we are taking in human
feedback. I think, you know, the jury's still out with respect to how to best, you know,
collect human feedback even. I think people are still debating things like whether or not,
you know, parallel-wise comparison versus rewrites are the best things to do. And, you know,
that's a valuable thing to think about. I think another thing to think about is how to actually
use that human feedback in the most valuable way, especially given all the scalable oversight
concerns that you guys mentioned. I think that's, you know, a significant limitation of
met mom as it is today. I think there's a lot of exciting things to do. And I think a lot of these
questions are like foundational questions for AI more broadly, but, you know, become more
acute and more relevant in the setting.
It's been great to have you on No Priors. Thanks for doing this.
Yeah, thanks so much for joining. Thanks, guys.