Short Wave - When AI Goes Wrong

Starting point is 00:00:00 You're listening to Shortwave from NPR. In 2017, a medical company called Epic Systems rolled out a cutting-edge program to hospitals. It claimed to use artificial intelligence algorithms to predict which patients are at highest risk of sepsis, which is a deadly condition where the body responds improperly to an infection. Sepsis kills a lot of people. I think it's one of the leading causes of death in U.S. hospitals. And even throughout the world, so if an algorithm could indeed predict, when a patient is at risk of sepsis, it would be a really big deal.

Starting point is 00:00:35 This is Syash Kapoor. He's a researcher and PhD candidate at Princeton University's Center for Information Technology Policy. He looked into epic sepsis prediction model for a blog and book project he's co-writing called AI snake oil. For the first four years, things seem to be going very well. You know, hundreds of hospitals across the country had adopted it, and that seemed to be proof enough that doctors and clinicians trust this tool. tool. Then, in 2021, researchers at the University of Michigan decided to actually analyze how accurate EPEC's tool was at predicting sepsis in their hospital. Initially, Epic had claimed

Starting point is 00:01:14 an accuracy of between 76 and 83% at predicting whether a patient gets sepsis in advance. Now, what this team found was the real accuracy of the tool in their hospital was only around 63%. Further investigative reporting by stat news, revealed the algorithm was prone to missing cases and flooding health care workers with false alarms. So this was a huge setback. What went wrong? Well, Sayesh says it was a number of things.

Starting point is 00:01:43 First, these algorithms are trade secrets, so no one outside Epic knew how it worked, which meant it was very hard to test. Clinicians and nurses and doctors who were already over there, did not really have time to validate these claims in depth. And I think part of it was also the fact that algorithms have been claimed to be these silver bullets,

Starting point is 00:02:02 which can solve a lot of societal problems. And so it might not even seem like it's possible that algorithms can go so horribly arrive and they're deployed in the real world. But they do. They're examples from around the world. Take the Netherlands. An algorithm there used by the Dutch tax authority

Starting point is 00:02:19 falsely accused tens of thousands of parents of fraud and demanded they pay back years of childcare benefits. In some cases, adding up to 100,000, and euros. This is potentially a huge problem because predictive AI is increasingly being used everywhere. Predictive AI is being used by our bank to predict if you'll pay back your loan. It's being used by a hospital when you enter to see if you should be prioritized for high-risk care versus low-risk care. It's being used by our insurance companies to predict how much you need to pay in the future or how likely you are to sort of be involved in an accident. And it's not just

Starting point is 00:02:56 businesses, scientists across disciplines are turning more and more to AI as a research tool without much scrutiny. Often, machine learning is enough to publish a paper, but that paper does not often translate to better real-world advances in scientific fields. So today on the show, looking beyond the hype over AI to the hazards, because Syash says there are a lot of them, particularly when it comes to predicting the future. I'm Aaron Scott. You're listening to Shortwave.

Starting point is 00:03:26 the science podcast from NPR. Before we get back to the show, we want to take a minute to say thank you so much to our Shorewave Plus supporters and anyone who donates to public media. After all, public media means that you, the public, supported. Everything you hear from the NPR network really does depend on your contributions. And for anyone listening who isn't a supporter yet, right now is a great time to get actively involved in creating a more informed public. That's our whole mission at NPR.

Starting point is 00:04:01 That's why we're here. If you like perks, Shortwave Plus offers sponsor-free listening, and if you just want to make a tax-dedectable donation to your favorite station or stations in the NPR network, that's great too. We've even had NPR Plus subscribers make additional contributions. What really matters is that you are part of the community that makes this work possible. Listener support is a powerful resource. It takes all of us doing what we can, when we can, to keep this public service going. please give today at donate.npr.org slash shortwave or explore NPR plus at plusnpr.org. Thank you. Okay, so, Syash, to begin, would you just give us a sense of how widespread AI and machine learning are in scientific research these days?

Starting point is 00:04:49 And what sort of things researchers are using them to do? Absolutely. I think literally every single scientific field has some of the other form of machine learning. broadly speaking, machine learning can fall into one of three types of tasks. So the first task is in searching for things or in trying to find sort of useful things from a large space of objects. So one example of this is astronomy, where you might use machine learning, for instance, on astronomical images. Or another example is in material science where you might want to simulate what kinds of materials would be useful for many different types of tasks. A second example of machine learning is in automating judgment. So this is where human judgment is offloaded to machine learning in order to sort of make judgment on similar tasks.

Starting point is 00:05:37 A third category of the use of AI for science is predicting future outcomes using AI. And this is where a lot of the shaky evidence or murky evidence mostly lies, because we have very little evidence to say that predictive AI can actually be used for real world tasks and it can actually perform work. well at these tasks. And kind of unpack that. Why is there's so little evidence or why, why is predictive AI on such shaky ground? I think the core reason is that predicting the future is hard. This might seem like common sense, but for a lot of scientific fields, people have been trying to do this for the longest time. And yet, for instance, there can be shocks in a person's life, which completely changed their life trajectory. If someone wins a lottery or has an expensive emergency room ticket, we really don't know how to model these things using machine learning. In fact,

Starting point is 00:06:28 one of the most exciting pieces of evidence in predicting people's life outcomes occurred in the last three years. This was a challenge called the Fragile Families Challenge, which looked at data from more than 4,000 families across the US and collected large amounts of data about each of these families. It turned out that even with this large amount of data, hundreds of researchers after they tried to build complex AI tools to predict their life outcomes, none of these tools was much better than a very simple linear regression model, which used just four features about a person. So it really does seem that unlike many other applications,

Starting point is 00:07:05 where more data and complex methods really do help, for predicting people's life outcomes, it's really hard even with large amounts of data and lots of features about people. And one reason for that is the idea of leakage. Can you define what leakage is and explain why it's a problem when it comes to developing these predictive machine learning algorithms? So leakage arises when there's a overlap between the data that a model is trained on and the data that the model is evaluated on. So essentially, it's like knowing the answers to an exam question before an exam is even conducted. When AI models are trained on the data that they will later be evaluated on, they can simply memorize these answers without.

Starting point is 00:07:50 actually doing any better on the real world task itself. We saw this with Epic's sepsis prediction tool as well. One of the features that the model used as its inputs was whether a patient had already been prescribed antibiotics. Now, antibiotics are often prescribed as a response to sepsis. So what this meant was all of the cases where doctors or nurses or clinicians had already diagnosed a patient with sepsis were counted as successes for the algorithm and not for the nurses or clinicians who had actually diagnosed sepsis. So I think this was one of the main reasons why there was such a huge difference between the algorithm performance that Epic claimed for a sepsis prediction

Starting point is 00:08:28 compared to the actual performance in the real world. Now, leakage is widespread in machine learning for science. We have seen leakage in over 200 papers across a dozen and a half fields. And what we've seen again and again is that the same types of errors of not separating the training and the test data seem to arise in distinct scientific fields over and over again. Which just shows how systematic this problem is

Starting point is 00:08:56 and how hard it can be to find these solutions because at least so far we haven't seen scientists across these fields come together to find common solutions. And another problem that you've written about is sampling bias. Absolutely. So sampling bias refers to this phenomenon where AI is trained on one type of population. or samples from one type of distribution,

Starting point is 00:09:19 but then it is expected to work on a completely different type of sample. And this sort of problem has real-world consequences in something like a job prediction software, correct? Exactly. So a lot of services in the last few years have started claiming that AI can be used to predict if a job candidate is fit for hiring. Now, one group of journalists looked into the claims from one specific, software vendor, and they found that very simple changes to a candidate's look. For instance,

Starting point is 00:09:51 if a candidate started wearing glasses versus not wearing glasses, or even if they put a picture of a bookshelf on the background, led to significant changes in their predictions of how well they would do at a job. Now, this could have huge impacts on a person's life outcomes, right? Like, this could basically mean that a person is screened out of a job based not on their capabilities or not on their like competence, but based on what's in their background. So, I mean, in many ways, these predictive outcomes, as you've talked about, are these black boxes that, you know, sometimes are discovered years later to be doing a poor job. Can you talk a little bit about why leakage and sampling bias and all these problems

Starting point is 00:10:37 are so rampant and why it's so hard to uncover them? Absolutely. So I think one reason is the fact that machine learning has been sold as this tool that can use hundreds or perhaps even thousands of features for making predictions. And so that makes it hard to investigate any of these tools in a lot of depth. Another reason that we've discussed already is the lack of real world evaluations. But I think by far the biggest reason for why leakage is so prevalent in scientific research is that often the last endpoint of a research project is just, just publishing a paper. Now, this is in contrast to a real-world application of AI like Google photos using image recognition, where if these researchers find an issue with their product,

Starting point is 00:11:23 engineers can go back and fix these issues. The last step in a research project is often the publication of a paper itself. It's really hard to correct the scientific record, simply because researchers are not really incentivized to go back and look for mistakes. And existing results are often treated as canon, simply because, they've been peer reviewed. But I think study after study has shown that peer review is not enough to find leakage, often because peer reviewers never have enough time to dive into this code that the people have written. And often leakage can be extremely subtle as well. An example of how subtle leakage can be is that one of the studies we found an issue with in Civil War prediction.

Starting point is 00:12:02 The error that occurred was in one line of code out of 10,000 lines. So unless someone is looking at each line specifically, unless they understand what's going on in each function call and so on, it's really hard to find these issues. I mean, that seems almost unsurmountable, because who has time to go through 10,000 lines of code on a paper that you're not involved in? A PhD student who's studying leakage, I suppose. But yeah, you're right. I mean, I think, to be honest, like, the way forward is to be as open as we can with scientific

Starting point is 00:12:36 research. And to end, I'm curious, do you think. it is possible for predictive AI to work and what would a world look like where we are using it for all sorts of things? I think it is kind of hard to imagine what a world would look like where predictive AI is actually useful because a lot of the applications for which predictive AI is being used right now actually end up failing in the real world. We have lots of evidence for predictive AI failing in healthcare and banking in all sorts of domains. So I think the question that is most interesting to me is what kinds of decision-making systems should supplant predictive AI or what

Starting point is 00:13:14 should we use instead? Humans are biased. So how can we come up with better types of decision-making systems which are better than either humans alone or algorithms alone? Most importantly, I think the algorithms that we actually use need to be understandable by a broad majority of human beings. Because that's the only case in which humans can actually understand why a decision was taken, contest incorrect decisions and hopefully even prevent these incorrect decisions from happening in the first place. Sesh, it's been enlightening to talk with you about AI. Thank you so much. Thank you so much for having me. This episode was produced by Burley McCoy and edited by our showrunner Rebecca Ramirez. Britt Hansen Check the Facts. Maggie Luthor was the audio engineer. I'm Aaron Scott. Thanks, as always, for

Starting point is 00:14:05 listening to Shortwave from NPR.

Short Wave - When AI Goes Wrong

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.