No Priors: Artificial Intelligence | Technology | Startups - The AI Will See You Now: Exploring Biomedical AI and Google’s Med-PaLM2 With Karan Singhal

Episode Date: May 18, 2023

What if AI could revolutionize healthcare with advanced language learning models? Sarah and Elad welcome Karan Singhal, Staff Software Engineer at Google Research, who specializes in medical AI and th...e development of MedPaLM2. On this episode, Karan emphasizes the importance of safety in medical AI applications and how language models like MedPaLM2 have the potential to augment scientific workflows and transform the standard of care. Other topics include the best workflows for AI integration, the potential impact of AI on drug discoveries, how AI can serve as a physician's assistant, and how privacy-preserving machine learning and federated learning can protect patient data, while pushing the boundaries of medical innovation. No Priors is now on YouTube! Subscribe to the channel on YouTube and like this episode. Show Links: May 10, 2023: PaLM 2 Announcement April 13, 2023: A Responsible Path to Generative AI in Healthcare March 31, 2023: Scientific American article on Med-PaLM February 28, 2023: The Economist article on Med-PaLM KaranSinghal.com Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @thekaransinghal Show Notes: [00:22] - Google's Medical AI Development [08:57] - Medical Language Model and MedPaLM 2 Improvements [18:18] - Safety, cost/benefit decisions, drug discovery, health information, AI applications, and AI as a physician's assistant. [24:51] - Privacy Concerns - HIPAA's implications, privacy-preserving machine learning, and advances in GPT-4 and MedPOM2. [37:43] - Large Language Models in Healthcare and short/long term use.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to No Pryors. Today we're speaking with Karen Single, a researcher at Google where he is a leader on medical AI, specifically on Med Palm 2, where he and a team are working on a responsible path to generative AI in health care. Google just announced the launch of its next generation language model, Palm 2, with improved multilingual reasoning and coding capabilities, which is behind Med Palm 2. So it's a great time to be speaking with Karen about everything he and his team are working on. Karn, welcome to No Pryors.
Starting point is 00:00:32 Hey, guys. So you've been working in this field for a long time. Tell us about how you ended up working on medical AI at Google. I think you started a fake news detector for using AI as a 19-year-old. Yeah, that was one of my first AI projects. I really got into AI thinking about how it could be used in socially responsible ways. And for me, I was thinking around the time of the 2016 election that maybe a little bit naively that we could, you know, AI-based solutions could be a bit of help for, you know,
Starting point is 00:01:04 things like misinformation and detecting that. I think in the longer run, I mean, I've thought of that as kind of a more naive project. And I think in the longer run, I've been thinking more about, you know, how it can help shape the trajectory of AI to be more beneficial and more broadly. And I think for me, thinking about the medical setting has been motivated largely by thinking about the fact that, you know, it's a great place to think about concerns around safety, reducing hallucination and misinformation as well here, you know, thinking about how we can produce medical question answers that are less likely to be harmful and all these kinds of things. And, you know, that motivation, I think, has driven us to this point where really going for
Starting point is 00:01:43 the jugular in terms of thinking about how to train these models, make them better in the setting. And so very excited about that kind of work. Have you been working on the medical domain your entire? time with Google? No. For me, this is just something I've gotten to the last year and half. So I've been new to it. I've been learning from an excellent team, and it's been amazing journey so far. What else has been the most interesting in your work at Google so far? Yeah, I started working out in representation learning and federated learning. So this is kind of the technology, representation learning in particular is kind of the technology underlying
Starting point is 00:02:16 a lot of the deep neural networks of today, including GP3, GPD4, and so on. And so this is largely about learning representations of text, of images, of other modalities, such that you can efficiently encode them, you can learn from them in the future, you can generalize the new text and images and so on. So work for this really started, you know, back in the beginning of the deep learning era, like in 2013, with convolutional neural networks and scaling those up and work to VEC around 2015 and glove and all these things. And I think since then, you know, we've been working on technologies around self-supervised learning, around doing that in a privacy-preserving way.
Starting point is 00:02:57 And so, you know, after a couple of years of working on that at Google, I had the opportunity to kind of quickly grow and start to lead a team. I kind of got to the point where I was thinking, like, okay, I've upskilled in a lot of ways. I've gotten to the point where I can mentor many other researchers in a lot of ways. And now it's a great time to be thinking about my next thing and, you know, going for something ambitious in terms of shaping the trajectory of AI. And so, you know, about a year and a half,
Starting point is 00:03:20 ago, a few of us had the idea to think about this medical setting as kind of a setting in which these concerns are especially important, and there was a ripe opportunity to think about this paradigm of foundation models and medical AI. And so within Google, we had the opportunity to pitch what's called a Brain Moonshot, which is kind of like an internal incubator program for ambitious research projects. And this is, you know, a lot of cool research projects that you've heard of from Google have eventually come out of this program. As we pitched that, we got accepted and funded. We got the ability to kind of get a bunch of compute to bring other folks on board with the sponsorship of a bunch of leaders. And our first thing together was really
Starting point is 00:03:57 MedPom. And so that was a really amazing thing for us to be able to work on together. Can you talk a little bit about Palm and how that's related to MedPalm and what Palm is to begin with and then how MedPalm is different? Yeah, absolutely. I mean, so the original MedPom work built on this model called Palm, which stands for Pathways language model. And so this is really a infrastructure that Google has built to be able to scale up large language model training, that is kind of Google-wide. And so the first Palm model was released in 2022, which was kind of this 540B decoder-only transformer model at the time, the largest densely activated model.
Starting point is 00:04:35 And, you know, it kind of realized these breakthrough achievements in code, in multilingual capabilities, in reasoning. And so, you know, I think a lot of the work. with respect to kind of improving benchmarks specifically that we're seeing with like POM, MetPOM, GPD4 recently, I think all comes down to a lot of the improvements that were made during Palm, during the training of POM. And so, you know, shortly after POM, there was this Minerva work where maybe like a few months after the POM work itself, people were able to show that on STEM benchmarks, there was this kind of zero to 100 or zero to 60 at least effect where, you know, you went from
Starting point is 00:05:16 random chance to, you know, solid performance across a bunch of benchmarks. And that laid the foundation for a lot of the work that Jason Way and others have had on, you know, thinking about immersion abilities of large language models. And so for us, that was part of the motivation for looking at multiple choice benchmarks as well for MedPom. And so for MedPom in particular, what we did was we took MetPom, this kind of general large language model trained on WebScale data, and then kind of further aligned it to the medical domain, we evaluated it base, but also thought about, like, given its limitations in long-form
Starting point is 00:05:51 medical question answering, thinking about things like safety, factuality, low likelihood of, you know, outputting an answer with bias, what do we need to do to kind of better align that model with this domain? And so really met Palm was an attempt to do that. Yeah, so basically it sounds like you started off with Palm and Palm was tested against a bunch of different types of tests, right? And so you could take the MCAT or you could take other types of effectively tests for professional accredation or for knowledge understanding. And then it sounds like you then said, hey, this seems really interesting, right? We're starting to get really good performance here. And so can we do something that's in the medical domain specifically?
Starting point is 00:06:29 And that was MedPalm. And so how did you do that alignment that you mentioned? Was it some form of RLHF? Was it some other form of fine tuning? Was it how you train the model to begin with? Like, what was the difference in terms of MedPalm versus Palm? Yeah, absolutely. I mean, when we tried evaluating POM in the medical setting, we noticed there's out of box on multiple choice questions performing pretty well. And when we took a variation of POM, the FlanPom model, which was, again, worked from Jason Way and team, you know, this is an instruction to a model that's been trained to follow instructions better. You know, again, it was able to perform quite well out of the box. And this was the first model that was able to perform above the pass mark on the METQA set of US Emily style exam.
Starting point is 00:07:11 questions. But then what we notice is that when we evaluated it on long-form medical question answering, like actually getting the model to generate response, there was a lot of limitations. And we compared that to clinician performance, it actually didn't do super well. And so really, that was the motivation for that MedPom specific alignment. And so what we did there was really thinking about instruction prompt tuning, which was this technique, which we explored in that MedPom paper, which is kind of a data-efficient technique, a technique that doesn't require too much data, to work because, you know, getting labels from doctors is expensive, which took a bunch of expert demonstrations of good behavior from doctors and then use that to tune the parameters of
Starting point is 00:07:51 the model and do that in a way that's a little bit more learned than prompting, but also less expensive than full fine-tuning. And so you did that. And then I guess if you start looking at this now shift from palm to palm two and from med palm to med palm two, did you basically just reproduce that same approach for med palm two or did you do anything different there? Yeah, and this is the work of many folks other than myself, so just to preface it with that. I mean, I think a few things that have been important have been, one is better objectives for true training and using something like a mixture of objectives, training objective. And so that's been something that's, you know, been crucial. And so this is work that started with UL2, a paper that was released also last year.
Starting point is 00:08:34 And then two other things that ended up being super important. One is following the optimal scaling laws that were empirically evaluated again in this work. And I think there's been a few works that have tried to do this from Open AI and DeepMind. And again, this work tried to understand in this context, what are the optimal scaling laws with respect to data and compute, and how do you trade those things off? And so this paper, again, found something similar to the chinchilla paper, which was that the total amount of data being used for these models was a relatively low compared to the number of parameters. and that if we wanted to add in more data, we could do so, we could train a better model in a more compute-efficient way.
Starting point is 00:09:11 So this model also did that. So that's an important improvement as well. And the third thing was kind of improvements in the data that were used to train the model. And so this especially focused on multilingual data, including more multilingual data and more code data in a bunch of different coding languages as well. Maybe just zooming out a little bit in terms of when you might apply
Starting point is 00:09:34 some of these different techniques to align a model to a specific domain. Do you have a framework in your mind for why you might do full pre-training from scratch, why you might do fine-tuning, why you might do a more efficient form of fine-tuning and when you can just get away with prompt tuning or prompting? How do you think about that? Yeah, this is a great question. I think it really comes down to the data that's available, both in quantity and relevance to a particular topic.
Starting point is 00:10:03 I think if you have an infant supply of data that's relevant for the specific problem that you're trying to solve then probably the best thing to do is pre-train everything from scratch and do everything N-10 and if you don't mind compute and money as well. If you are working on a task in which
Starting point is 00:10:18 general pre-training data in the web confers general advantages to that task, and so that could be domain knowledge, it could be general abilities like reasoning, which is very applicable across many tasks, which I think is, you know, the case for medical reasoning as well. Then I think it makes a lot of sense to build on top of an existing model,
Starting point is 00:10:39 especially if you're sensitive to things like cost or compute, which most people are these days. And so, you know, I think on that spectrum between things like prompting and prompt tuning all the way to like full fine tuning, I think it largely comes down to, so given an existing pre-trained model, which is, I think, a big hurdle for most teams and most people. to train a large scale pre-trained model. The question is, do you prompt it?
Starting point is 00:11:07 Do you prompt tune it? Do you full fine-tune it? I think that largely comes down to data. If you have three to five examples, let's say, then I would prompt it. If you have maybe 10 or 50 examples, it would either be prompt tuning or fine-tuning. I think generally in that realm, prompt tuning and fine-tuning perform similarly. And I would prefer prompt tuning if you're at all sensitive to things like compute or cost. If you care about the best performance and you have more than 100 examples, then probably fine-tuning is your best bet.
Starting point is 00:11:39 And it's not as expensive as full pre-training if you're doing it with a model that's been pre-trained, of course. When you thought about evaluation of this model, you must have been surveying the landscape for the other sort of medical, probably, you know, science-specific and then medical-specific models. Like, what's out there? And how did you guys think about evalling and changing eval? Yeah, absolutely. And this is not the first work to explore the potential of a large language model in science or biomedicine. And so I think it's important to acknowledge all the work that's come before us. What we saw when we first came into this work and tried to understand what other models existed,
Starting point is 00:12:19 what other evaluation has been done, was that one, there was a few exciting works from other teams, like Alactica or BioGPT and so on that we thought we could learn from and benefit from. And so that was a really exciting thing to be able to see. And the second thing we saw was that there was a bit of a shortage of kind of a systematic way of doing evaluation of these models. And so it didn't feel like there was a systematic way to think about automated evaluation of the clinical knowledge of these models. So, for example, via multiple choice benchmarks. There were a few popular benchmarks like the MedQA benchmark, but it varied across paper what benchmarks they were studying. In some cases, we felt like these benchmarks were not high quality.
Starting point is 00:13:01 And so that was one thing that we saw. I mean, another thing that we saw, which was more acute, I think, was kind of a lack of detailed human evaluation across many of these works. And so there was some steps in this direction that we were able to build on. But I think for the most part, a lot of these models that have already existed didn't have kind of detailed human evaluation given a use case like medical question answering. And so I think that to us was a significant limitation as we think about, you know, the real world potential of these models. because, you know, when it comes down to it, we have to make sure that it actually serves humans and is beneficial to humans. And so for us, that was like a significant motivator for the Met Palm work being relatively
Starting point is 00:13:42 evaluation forward and thinking carefully about human evaluation with both physicians and late people. How do you think about where that bar is? Because I think it's one of those things that, you know, having started a medical-centric company before, on the one hand, you really want to be cautious in terms of providing people back information. accurate, right? And so when I was working actively on the operating side of color, we spent a lot of time agonizing over the ensuring that there was also provided back to patients were as accurate
Starting point is 00:14:13 as possible, particularly in the context of anything that had to do with, you know, core genetic or other information. The flip side of it is, you know, I remember I took my son to the emergency room when he was younger and the doctor said, I'm going to go research this case and I'll be right back. and I had to go ask him another follow-up question and I go around the corner and he's in his cube literally Googling the symptoms, right? And so it wasn't like he had some deep,
Starting point is 00:14:38 accurate source. He was just making things up, right? Effectively, right? I mean, I've seen Google results and you're kind of clicking around. He was just clicking around. I was like, oh my gosh. And I could see the query, right?
Starting point is 00:14:47 So I knew he was looking at my kids' symptoms. He had no idea, right? And so there's this bar from, hey, it needs to be incredibly accurate and correct on through to, well, the state of the art actually isn't that amazing in many circumstances.
Starting point is 00:14:59 And so how do you think about the right quality bar for these sorts of things in terms of real use application or practice? That's an amazing, great question. I think, as you said, there's two competing forces here, right? Obviously, the stakes are high in the medical setting, and counterfactually, you want to make sure that the information you provide versus the information they would have otherwise gotten is actually high quality. And so that's, you know, very, very careful as you think about, you know, any informational use case for these models. At the same time, I think it's useful to recognize that people are searching for health information online and indecision is a decision as well. And so, you know, a large percentage, roughly 10% of searches on the internet are for health information. And some of these are coming from physicians themselves, as you mentioned in life.
Starting point is 00:15:46 And so, you know, I think that there is a responsibility to think about how to shepherd this technology carefully and safely towards that real world impact for patient health information. And I think that is crucial as well. And I think one thing that has been missing from our work so far is really grounded evaluations in a specific use case in a workflow to show that there is a benefit, both in terms of safety in the short term and in terms of kind of long-term patient outcomes as well. And so, you know, I think that could be a health informational use case. It could be other clinical workflows. But, you know, I think that's one thing that we have to really make sure we do and, you know, are careful about before any kind of real-world use-case. year. Yeah, that makes sense. Yeah, it definitely feels like in the medical world, the importance of safety is paramount, and at the same time, there's very little cost
Starting point is 00:16:34 benefit being done anymore. And so there's, you know, interviews with Jansen and other sort of giants of the industry basically saying, you know, we need, we need to think about the benefit side, not just the cost side or the safety side. And what you're working on, I think, is so important in terms of, if you think of the really big areas of societal impact, it's what you folks are doing, right? If you could provide amazing health equity globally for everyone in terms of this information. How powerful is that? I mean, that's fundamental.
Starting point is 00:16:59 And maybe education is the other one, right? And it feels like AI really has a promise in both of these areas. And so I always worry about, you know, how do you make sure that this can get to market because it's so valuable, but there's going to be all these regulatory or safety obstacles that in some cases are merited, but in some cases may actually prevent the emergence of really important applications. So I think it's awesome that you folks are working on all this and are being so thoughtful about it.
Starting point is 00:17:22 How do you think about what workflows this is going to be most useful for? So, you know, if you look at a lot of the bio or biomedical AI companies, for some reason, they keep doing drug development. A, why do you think that is? Because this seems like such an important part of healthcare and probably the bigger driver of healthcare efficacy. And so, A, why is everybody just going and building another protein folding model or, you know, molecular company?
Starting point is 00:17:48 And B, where do you think are the best applications of what you've been working on? Yeah, these are great questions. I think on the drug discovery front, There's a bit of a playbook here, which any new company here looking for some revenue in the short term can follow. And that could be a safe option. Like there are, for example, existing AI augmented pipelines for doing things like given small molecule chemistry, predicting things like absorption or toxicity. And it's kind of relatively easy to see that some of the more modern models, if placed into these pipelines, could perform better. And so there's like a relatively safe bet there.
Starting point is 00:18:25 And so I think that probably accounts for a lot of the popularity of that as a use case. I totally agree that like there is a kind of a chance to go for the jugular here in terms of health information, for example. And so, you know, I think this is something that is going to be crucial. But I think it is also something where a lot of the big players are more risk-averse. And so, you know, the people who gate access to health information or provide access to health information are also thinking not maybe super counterfactually about the positive benefits of things and they're thinking more about the risks. And so, you know, I think that is also, you know, a concern that's been slowing folks down both in terms of big companies and smaller companies. And, you know, I think
Starting point is 00:19:06 there is an opportunity to kind of think more about that and what that could look like. And I think the company that, you know, gets that right or, you know, the set of companies that get that right, I think we'll also have a seat at the conversation when it comes to policy and regulation and things like that. And so they have the chance to shape, you know, what this looks like for the future. And so, you know, I think that's going to be potentially quite impactful. Yeah, it seems very exciting because if you look at healthcare, it's 20% of GDP. Pharmaceuticals are about 20% of that. And then drug development is a fraction of that, right? So really what you folks are focused on in terms of the types of models that you're building is at least, you know,
Starting point is 00:19:44 16% of GDP. You know, maybe it's more than that if some of the pharma stuff is, you know, more clinical decision-making around who gets a certain pharmaceutical. Do you view this as a technology that's initially a physician's assistant? Do you view it as something that helps with adjudication of medical claims and billing? Like, there's so many places where this can sort of insert. I'm just sort of curious, like, you know, where do you think you'll see this technology popping up first? Yeah, I think we're already starting to see it in some clinical workflows when it comes to documentation and building. I think there are a lot of companies and people thinking about taking models like GPT4 and applying them in that setting. And I think that that is definitely
Starting point is 00:20:26 going to be something. I think that is also going to be something where players like Epic are going to be able to partner with existing models and I think potentially deliver real value there. And I think that's very exciting. I think that's something that also general domain models will be potentially quite good at as well. I think where there might be more of a need for specialized models is when it comes down to kind of higher stakes workflows, and I think that might look in the short term more like a physician's assistant. And so imagine, for example, an agent that can work with the radiologist,
Starting point is 00:21:01 help them interpret a scan and leverage the benefits of AI to kind of help contextualize, you know, a patient's medical record or any previous scans or different angles, or different angles of scans that a patient has had to help a radiologist to write a more accurate report. I think that's something, that's the kind of thing, which I think, you know, is in the sweet spot of both feasible today, you know, leverages the benefits of AI in terms of taking an additional context
Starting point is 00:21:25 and, you know, potential multimodality and all these kinds of things. And it's also potentially in a sweet spot with respect to regulation as well. And so I think that's, you know, something that could happen in the shorter medium to short term. How do you architect a model or workflow in this context? to deal with things like HIPAA or patient privacy. So I feel like healthcare data is unique from the context of what you're allowed to do in terms of who you send it to with what permissions from users. So is it just you have to get the right user opt in and then it's fine?
Starting point is 00:21:53 Or is there extra work that you need to do in terms of blinding data or doing other things relative to the prompts or queries you're sending in? Yeah, it's a great question. I mean, I think this is something that people are just trying right now and just seeing what happens. And it's kind of interesting. People are just putting in patient information into GPD. for, sometimes they're redacting information and all these kinds of things. I mean, I think the ideal way to do this obviously is more privacy forward, I think, in terms of building trust with relevant stakeholders and all these kinds of things.
Starting point is 00:22:24 You know, I think a starting point is just models that are able to automatically redact very sensitive information from, you know, being sent further down a pipeline. I think that's something that's like a very low-hanging fruit that, you know, many people can do. there's also potential for HIPAA compliance within an organization. So I know some organizations working in the space are partially HIPA compliant or are kind of trying to make that claim. And I think that's something that's useful. And I think that's something that we should work towards as well. You know, I think in the longer run, I think a lot of these concerns, I think, are actually unclear in terms of how things will work out.
Starting point is 00:23:04 Like, I think there is kind of a bigger question about software of unknown provenance and how that will be. be used and regulated, you know, in the future, there could be some kind of situation in which, like, these things actually end up being very hard to scale up and apply in the real world for, you know, high-stakes settings. But I think we'll probably end up with a scenario where it'll become obvious that we need to and that we must, and that doing so will improve patient outcomes. And so then I think it'll be time to have, like, a serious conversation about what regulating these models and making sure privacy concerns are mitigated looks like. And I think, you know, I think we have yet to have that discussion.
Starting point is 00:23:39 question. Yeah, hip is kind of interesting from the context up. It was an incredibly well-intention piece of legislation, but the flip side of it is, it's really backfired in all sorts of ways in terms of actual patient good. And you see that sometimes as well in terms of some of the things that as you sign up for a clinical trial or other that you can actually do with your own data where sometimes you're constrained from accessing it. I know of one example where somebody had brain cancer, they had a glioblastoma, and it was a researcher at MIT. And he participated in a small clinical trial, and then they were unable because of compliance to give him his own data so that he could try and discover drugs against his own glioblastoma, his own brain
Starting point is 00:24:19 cancer, right? And so sometimes you see these very well-intentioned approaches in terms of the protocols around a clinical trial or on HIPP or other things that are very well-intentioned in terms of what they want to do, but then sometimes they may backfire as you start to enter the modern data world, since I think that legislature is now almost 30 years old, right? And so I just think it was set up for a world that's very different from what we have now in terms of the liquidity and fluency of your ability to interact with information and, you know, patients driving their own diagnoses and things like that. So, you know, my hope is that some of these things get rebalanced in the AI world, since it could be so valuable to things like
Starting point is 00:24:54 what you're doing. I was just going to say that is the status quo. And you've also worked on the areas of, you know, privacy preserving machine learning and federated learning. Those areas have broadly taking a backseat to, let's say, like, scaling and aligning these more centralized models? Like, do you see a place for that technology in this field? Yeah, that's a great question. So, I mean, as I mentioned before, the first couple of years in my career were really thinking more about privacy preserving machine learning and, you know, federally learning and scaling that up and coming up with new algorithms that can learn new things without, you know, sending all the data to a centralized place. And so in a lot of ways,
Starting point is 00:25:32 that has a very, very natural fit with this setting. And part of my motivation, when I first started working on the setting, was bringing in a lot of that expertise and bringing into that setting. My sense is that I think one hesitation I have there is that I think a lot of, you know, the most impactful work that's going to happen in this setting is going to happen with the largest and most capable models, at least for the next few years, it seems like. And I think that, like, one thing that we're seeing is that even without
Starting point is 00:26:02 any patient health information put into these models. Like, for example, MedPom and MedPom 2 are trained without any patient health information. They, they're just kind of taking all the knowledge of POM and POM 2 and then just kind of aligning them and making them behave in a certain way. I think at the short term, there is this kind of thing that we see where models like GPD4 and MetPom and MetPom 2 are able to do, you know, surprisingly well without any patient health information. And so it seems like we can get fairly far with that.
Starting point is 00:26:30 I mean, the longer run, I do think that, like, you know, coming back to that question of data and how do you think about how to train a model depending on how much data you have and how relevant you have, how relevant that data is, the ideal thing would be to have access to all of the data, but in a privacy preserving way, in a way that people are in control of their data, are able to revoke access to that data and are able to kind of benefit from that shared understanding of their data. And so that's the kind of the ideal world. But I think there are like real world obstacles to do. doing federate learning on health data, which actually kind of increased activation energy
Starting point is 00:27:04 to the point where in the next few years, I doubt that like the biggest advances are going to come from using federate learning approaches. But I think there are kind of intermediate solutions, which people often sometimes refer to as federated, but maybe are not technically federated, which are things like trusted execution environments or other environments in which models are running, but, you know, don't have, the folks at Google don't have access to the data or direct access to the models. And so there's this ability to kind of silo that from, you know, silo any patient health information in the future potentially or, you know, any other data that's quite sensitive from engineers or other folks at, you know, big companies or small companies.
Starting point is 00:27:46 Yeah, going back to perhaps more promising near-term areas of research, you've had this idea of building a medical assistant as a sort of laboratory for safety and alignment research. Can you talk about that? Yeah, absolutely. I mean, this is a lot of what got me thinking about the setting, especially coming into the setting as somebody who, you know, didn't have much of a medical background in terms of expertise. I was really thinking about, you know,
Starting point is 00:28:13 what are the big things that I could do to help shape the trajectory of AI or nudge it in a more beneficial direction? And thinking about AI safety seriously in terms of both short-term and longer-term risks, I think was important to me. And so, you know, one thing I've been, become more convinced about over time is this idea that, you know, many organizations right now, Google, deep mind, anthropic, open AI are right now looking at the idea of a general chat assistant and kind of instead of like doing alignment research in a vacuum, are looking
Starting point is 00:28:45 at that setting as a way in which we can think about kind of better refining these models and better aligning them to human values. I think there's a good chance that this setting, this medical setting, for example, medical question answering, or maybe more broadly, I think ends up being a better scenario to study concerns about technical safety and to mitigate concerns like misaligned with human values or hallucinations or things like that. And so, I mean, I think this comes down to things like making sure the incentives are aligned with respect to releasing products. So, for example, I think if any organization wants to release products in the space, it actually needs to work on these problems more so than, I think, chat GPT. I think it also comes
Starting point is 00:29:23 down to kind of the stakes of the setting. I think everybody feels like the stakes of the setting are high enough that everybody feels like these issues are especially important and there's no debate about that. And I think there's also like some more subtle technical points like I think one issue that, you know, alignment researchers are now working on is the idea of scalable oversight, which means how do you give human feedback to a model when human feedback might not be super well-informed or it might be unreliable because AI capabilities are starting to reach human level. And so when we start to get to that point, like things like RLHF start to fail and starts to
Starting point is 00:30:02 become unclear what to do. And so I actually think the medical setting is a scenario in which this is already more obvious. So you're already in a setting in which you need experts to be able to evaluate answers. And one thing we're seeing with MedPom 2 is we get closer to physician-level performance on medical question answering is that it's hard to tell the difference anymore. It's hard to tell the difference between different models. It's hard to tell the difference between models and physicians. And when you're at that point where it's uninformed oversight, then it becomes very tricky to think about aligning to human values. And so that problem is super well motivated in the
Starting point is 00:30:36 setting. And that's something I'm very excited about. What do you think is a solution to that? Because if you look at the gaming analog, which is probably a bad analog here, right? Once machines were better than humans at things like Go or chess or other things, people started learning off of the things that the machines were doing that were unique or creative or different or the problem solving was very different. And if we really want this technology to be incredibly valuable for medical applications, in some cases, we may end up with these suggestions that will really work well, but that to your point, people may misinterpret or misunderstand. And so how do you think about evaluating things when the AI will be better than
Starting point is 00:31:14 a person at medical adjudication or better than an expert? Yeah, this is, I mean, this is, you know, really, really interesting question. I don't think I have all the answers, but I think there are approaches that, you know, people at Google and other organizations have been looking at. I mean, I think a couple ideas here that I think are interesting and useful. One is the idea of kind of self-refinement or self-critique of these models. And so this is the idea that these models can take their own responses, give critiques often guided with human feedback.
Starting point is 00:31:46 And so that's where the place where human feedback comes in. some of these techniques, there's no human feedback. In that case, I'm not sure that's as valuable. Give critiques guided by human feedback and then use that to produce better answers. That's one line of approaches. I think a second line of approaches is around debate. And so the idea here is that it's easier for a human to judge a debate between two different answers than to judge the answer itself.
Starting point is 00:32:13 And so the kind of standard for verification is a bit lower here. And so there's that ability for humans to be able to judge a response that potentially they wouldn't be able to judge otherwise via things like debate. And so that's another thing. I mean, another thing, which is people are working on as well is thinking about how we can take AIs that are less capable and use that, use them to kind of supervise other AIs that are more capable. And so this is kind of the motivation. I mean, this is partly the motivation of RLHF as well, even though it's about human feedback. It's about training a reward model that takes into account human feedback, and then at that point, it's AI feedback from then on it, and then you use your RRL algorithm, and then you get rewards from your reward model. RL, AIF, or constitutional AI, you know, kind of builds on that idea, but there's also limitations to that approach as well.
Starting point is 00:33:02 I mean, I think if you ask, you know, researchers across all these organizations, have we solved this problem? Do we know what we're supposed to do? I think most of them would say no. It seems like a pretty consequential problem, so I'm excited for more folks to work on it. Yeah, one thing that I feel like would also sort of be generated as a side effect of all this is just you end up with these really interesting closed loop data sets over time that may be unique outside of an EMR or somewhere else or a really robust medical record system. Because if you have effectively physicians assistant or something else, and then you have the endpoint of what happened based on treatment, you actually have a really interesting retrospective data mining training set. Yeah, I mean, I think that's like another opportunity for feedback for these models, which could have a huge impact on the world. Yeah, it'll actually be data-driven medicine, which I think, you know, sometimes happens, but sometimes doesn't.
Starting point is 00:33:52 So it's very exciting. I guess one more question is just, you know, there's amazing potential here. And if I look at the history of medical technology, you know, in the 1970s, there was something known as the Mycine project at Stanford where they built an expert system, which was an old computer program of its time or a computer program of its day that was sort of a precursor to some other things that eventually happened in AI. they had an expert system that outperformed all of Stanford's medical staff on the prediction of the infectious disease that somebody had. So 40 years ago almost, we had a machine that outperform people in terms of diagnosis, but it never got adopted. And so often when I look at medical technologies, there's this almost like anti-adoption curve. In some cases, for the things that may be most impactful, how has the medical field embraced or not embraced these AI
Starting point is 00:34:43 models. Is it different this time? Are people excited about it? Are they not excited? Does it really depend on the type of physician? I'm just sort of curious, like, what the reaction has been from the medical community to date. Absolutely. That's a really great question. You know, I think when we started this brain moonshot, which we call it within Google, that was actually our motivation. It was really to think about the fact that these models had already, kind of already exist, and there was this opportunity to catalyze the medical AI community to really think about them carefully, think about the promise there, and to catalyze the AI community to think about how we can resolve any remaining limitations that would prevent real-world uptake.
Starting point is 00:35:22 And so this was really our goal. And I think when we started this, there was much less conversation about the potential for large language models and foundation models for healthcare. And I think, I mean, partly because of, I think largely also because of other work that's gone on, you know, with GPT4 and excitement around that, I think there's much, much more conversation about, you know, how these models can be used in the setting in a productive way. I think that's really, really exciting. And I think there's a lot of optimism, I see, but there's also a lot of justified concern about, you know, the potential limitations of
Starting point is 00:35:53 these models and how we can, how we can get over them. Personally, I mean, from what I've seen, from giving talks to different groups and chatting with different folks and different stakeholders, I think there's like a, you know, a widely held optimism about this technology and about the potential. But I think there's also kind of a little bit of fear that I think, you know, people have seen in other domains, like, I think programmers often feel a little bit of fear when they see GPD4, for example. And I think it's not necessarily a fear that, like, jobs will be replaced in the short term or things like that, but it's more of a fear of, like, look how fast things are moving. This is, this is nuts. Like, I think about
Starting point is 00:36:29 just an improvement from MedPOM 1, GPD4, MedPOM 2 in three months. Like, it's absolutely crazy. And I think we, you know, it's definitely an inflection point for AI, as you guys know. And I think it's definitely a good time to think about, you know, what are the most important problems we need to solve versus like getting caught up in the hype wave and, you know, forgetting to solve the most important problems as well. I think back to a lot sort of point earlier, thinking about the actual benefits of these technologies at scale, if adopted, even at human and at some, you know, defined superhuman level, should we come to some sort of agreement as a democratic society? about what Eval looks like is really important in that if you just think about what the status quo is for somebody who has a complex case in a median background in America, what do they know about the error matrix of their doctor and what, you know, in a field that's also advancing in parallel to AI, like the specific rare condition that they have, it's not super
Starting point is 00:37:39 encouraging, right? And so in terms of leverage for a field where the status quo is not sufficient, not as a comment on, you know, the class of physicians and researchers, but just in terms of the quality of care that we want to be able to offer every person, it seems like we want to set a reasonable safety case, not a unlimited safety case, right? Which is I think is one of the things that has held back other sort of mission critical AI applications in the past. Maybe on that note, like, one last ask for you in terms of encouraging some optimism, you know, you're working on the state of the art in this field and thinking about the barriers to the applied use, like five years from now, like, how do you hope we are using
Starting point is 00:38:28 large language models in the medical field? Yeah, I guess I think about this in two broad buckets. I think there are two broad types of things that we can do for large language models in the medical field. I think the first is increasing the standard of care very broadly. And so that looks a lot like, you know, increasing access to health information, providing assistance to physicians. So the radiology example I gave earlier, potentially clinical decision support, like double checking a doctor's decision or quality assurance for a radiologist's report. So if, you know, a radiologist is dictating a report, they say no plural effusion scene, but then it's written down as plural effusion scene, then maybe an AI double-checked that and just make sure that's
Starting point is 00:39:10 what was intended. I think augmenting telemedicine, I think, is kind of a short-term opportunity that I think in the next five years is very achievable. I think the other big bucket of things that is very much achievable is augmenting scientific workflows. And I think this could be a longer-term thing than five years, but I think there's also short-term things that we can do as well. So thinking about looking at correlations across modalities and existing data to find novel
Starting point is 00:39:35 biomarkers for existing diseases that we know about, or kind of using large language models as research assistants. So I think there's already a lot of work on the idea of literature search and augmenting literature search with large language models. I think there's a lot of opportunity there. And that goes a little bit beyond, you know, what Metpom is likely going to do. But I think that's something that I think, you know, it's going to be really promising with respect to the future of AI.
Starting point is 00:40:01 Because I think it's the long term, when things go really well with AI, It's going to be because we've solved a lot of the most pressing scientific problems of today. And I think that's going to be because it augmented scientists. It helped scientists. It helped us figure out what are the things that we're missing. And I think there's a lot of potential there. So I'm also really excited about that in the long term. Awesome.
Starting point is 00:40:22 Rapping up, is there anything else you think we should touch on? Yeah, absolutely. I mean, I think for real world uptake of these models, there are a few large language model capabilities, in some cases that already exists, but we need to figure out the right way to do them. And I think a few of them are just, you know, multi-modality, which is something that we were working on. We kind of previewed last week at I.O. And grounding and authoritative sources, I think is important as well, thinking about how these models can use tool form or like approaches to, for example, query authoritative medical information like a human would, but potentially better. I think that's also, you know, one way of getting around the risk-averseness that you see in this area with respect to health information.
Starting point is 00:41:03 if you're able to attribute information to an authoritative source, I think that has been something that has progressed this area in big companies before. And so where, for example, Google is doing that with health information is largely because it can attribute things to the Mayo Clinic and other organizations. And so I think that's going to be really important for moving this forward. I think also solid research, thinking about better ways to improve the ways we are taking in human feedback. I think, you know, the jury's still out with respect to how to best, you know, collect human feedback even. I think people are still debating things like whether or not,
Starting point is 00:41:41 you know, parallel-wise comparison versus rewrites are the best things to do. And, you know, that's a valuable thing to think about. I think another thing to think about is how to actually use that human feedback in the most valuable way, especially given all the scalable oversight concerns that you guys mentioned. I think that's, you know, a significant limitation of met mom as it is today. I think there's a lot of exciting things to do. And I think a lot of these questions are like foundational questions for AI more broadly, but, you know, become more acute and more relevant in the setting. It's been great to have you on No Priors. Thanks for doing this.
Starting point is 00:42:08 Yeah, thanks so much for joining. Thanks, guys.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.