Science Friday - Why so many studies can’t be replicated

Starting point is 00:00:02 Hi, I'm Ira Plato, and you're listening to Science Friday. How do we know what we know? That's where science comes in. It gives us a method for testing our assumptions and getting trustworthy results. And that's really the definition of research. First you search with an initial study, then you research with follow-up studies to confirm. But some researchers have warned that many scientific studies cannot be replicated. This is what's called the

Starting point is 00:00:32 replication crisis. To find out how deep the problem goes, the U.S. Defense Department Research Projects Agency, you know that is DARPA, funded one of the largest analyses of social science called the Score Project. With the help of hundreds of researchers, they check the results of thousands of papers across economics, education, and psychology. And the results? Researchers could only replicate half the papers analyzed. Here to talk about the project. and give an update on how the scientific world is trying to change itself. It's Dr. Tim Errington, Senior Director of Research at the Center for Open Science, one of the leads on this project.

Starting point is 00:01:14 Joining him is Dr. Abel Brodour, Professor of Economics at the University of Ottawa, and founder of the Institute for Replication, who recently released the results of a separate replication study. Welcome both of you to Science Friday. Thanks for having me. Thanks for the invite. All right. Thank you for joining us.

Starting point is 00:01:32 We have covered replication on the show in the past. So what made this project different? And why was DARPA involved? Yeah. So there's a couple aspects to that to describe. One reason that DARPA was involved, and I can't speak on behalf of them, is that there were prior projects that were reported having challenges in terms of having confidence in research, right? How much confidence should we have in anything that's being published?

Starting point is 00:01:59 I think the wrong way to do that is it's binary. It's published. It equals it's true. Not published means it's not true. So that's definitely not the case. And DARPA was paying attention to this. They use research. They use the social behavioral science research a lot. And they were trying to figure out methods to sort through that, to sort through how much confidence they should have in any given research finding. And so because we have a challenge in understanding how much confidence we should have in something because of this replicability issue, they wanted to invest in seeing if there were ways to develop automated tools to assist with that. And so SCORE is a project that was designed to not just repeat the experiments, but to use that as a ground truth in the development of AI tools to assist with confidence assessment. Again, something that's largely done just by humans. So that's one key aspect was this was started in 2019, so before a lot of that AI discussion

Starting point is 00:02:53 that's going on now. The other part that makes this quite unique is the breadth. The breadth of this in terms of disciplines across the social behavioral sciences. as you mentioned, and also the volume. We looked at 10 years' worth of journals across 62 different journals. So that is a much larger scale than any prior project. What's at stake here? I mean, how are the papers you analyze being used in making policy decisions?

Starting point is 00:03:18 I imagine that's an important part of it. Yeah, there's a lot of, I think, in terms of the breadth of the research that she used, it's used in a lot of different aspects of, right, from policy to, I think, even our own individual actions. So let me give you some examples of the type of research that was included in this project. It could be something looking at how public employees leave the U.S. Civil Service. You can see how that would have an impact on policy decision-making. Or, you know, does being the victim of a crime spur political participation, right?

Starting point is 00:03:47 Again, you can see the impacts that that would have in terms of various policy or decision-making. And those are the types of research across the social behavioral sciences that this project was investigating. Mm-hmm. Now, Belle, do you have anything to add to that? Yeah, I think, you know, one of the big problem is there's a lot of good research out there. And unfortunately, there's also research that is less good. Just like to put this in context, a paper might not replicate for many reasons. It can be just like there's data missing.

Starting point is 00:04:17 You're trying to running it, but somehow it produces different results. Maybe things are not robust when you start playing with the data. Or maybe you use completely new data and you get a different result. And I think what's really cool about all these projects coming out right now is we get a much better idea of actually there's problems everywhere and they compound. Tim, what was some of the common trends then that you saw in the score project? Right. One would be just sharing, data sharing. So if you want to ask the question, somebody publishes a finding, they report some statistic. Can I have confidence in just the reporting of that statistic? well, in order to do that, I need the data and I need the method that they use, just to repeat that,

Starting point is 00:05:03 that simple reproduction step. But it's hard to do that when you don't share data. Now, are you saying they don't, just to jump in there, are you saying they don't share the methodology either? I mean, so you can reproduce what they've done? Right. So there's a lot of dimensions here that don't always get shared, which is interesting because you think of that hallmark of sciences, I share everything. And then that's actually, that's where I have confidence in it. And there's a lot of nuance here. So I'll first say that the high level statement, which is, yeah, across. of the board, there's probably not as much sharing as one would expect or hope for. That includes the

Starting point is 00:05:32 data, the analysis, even the methods of collecting the data, the sources of data. It starts to get really complex. And so when you don't have that information, you're stuck doing a couple things. You just trust it at face value. It got published. It must be true. But that's not how science works. Or you're left making assumptions to fill in the gaps. Both of those are not ideal. So one thing that we can do is start sharing more. And you have to incentivize people to share data. It doesn't mean it's more reliable. It just means that now you can interrogate it the way that Aval was just talking about. Mm-hmm. About you also just released a new study, analyzing economic and political research. I'm wondering how yours was different from the score.

Starting point is 00:06:12 What did you find? I think the results are much more positive, and there's a lot of reasons for that. And I think the main one is we looked at recent papers. So we started this project in 22, So we look at papers published in 22 and at the end 2023. And data sharing is going up. There's less and less hesitancy, I would say, an economist vocal science in terms of data sharing. That doesn't mean everything's perfect. We still find like coding errors in 15, 20% of papers.

Starting point is 00:06:44 Results are robust maybe 75% of the time. So, you know, that's better than let's say 50% or some of the rates that were documented. So I would say things are just getting better. That masks a lot of things, though. So for instance, at the Institute for Reification, we master reproduced studies in economics, political science, Nile psychology, public health, environmental research, and data sharing practices are completely different across fields.

Starting point is 00:07:13 So I would say things are getting better. That's mostly what we document, I think. But it's definitely far away from being perfect. Are they getting better because we, recognize the problem? Or did they just get better for some other reason? I think that's part of the story. I look back at my own research that I was doing back in 2010 as a master's student. And my coding was terrible. And I just, I'm not laughing. I'm sorry. And that's fine. I laugh at it too because sometimes I look back at it and I'm like, dear God, that was bad. But also I remind myself

Starting point is 00:07:48 then back then it was impossible virtue to look at other people's codes. There was no codes online. Whereas I look at my own students, P.G students nowadays, and they have access to so much, so much coding of other researchers. And coding, and coding for the layperson, what does that mean? Yeah. So, you know, let's say you're doing a study. You're interested in the effect of like minimum wage on, I don't know, like unemployment rate or the effect of a policy on differentiation in the Amazonian rainforest. course. You're going to gather data, maybe satellite data, to understand, like, is there

Starting point is 00:08:23 deforestation happening? Then you have another data set on, like, the policy change, but then you have some control variables to make sure that you get a causal effect. But then you need to merge all these data together, so you need to code that in Excel, or it could be in R, Python, et cetera, status quo software. But then you need to run the regression. You need to actually do the analysis in the status quo software. And you can make stupid mistakes. It could be like duplicates, replicates. You have the same individual again and again by accident. It can be you say something in a paper, but actually in your codes, you did something else. You didn't really look at deforestation using this specific ways of measuring it. But I think just code review, like,

Starting point is 00:09:07 it's kind of crazy, but imagine you do a research paper. You have your research assistant doing the coding. You're done. You submit this to a journal, and they accept it. And it went through peer review, their external expert that looked at it. And then it's published. During the entire process, nobody ever looked at your data encodes. They trust 100% what you've done. And what we're trying to do is to go after that and being like, well, let's have a look at the data in codes to see whether there's errors and things like this. And you would think this happens, it should happen throughout the process of publishing. But the norms are just not there. So things need to change. And I think they will on one point because of AI.

Starting point is 00:09:50 Yeah. Tim, what do you think of this? Yeah. You asked a really good question of like, is it just happening on its own or, you know, is part of this discussion actually part of what's causing these changes? Because I absolutely do agree. I think things are improving. And I think there's a couple reasons for it. One is we're talking about it.

Starting point is 00:10:08 We're talking about it here, right, on Science Friday. So this is an example of it getting to the point where it's more common to have these discussions. I completely agree with the point of. the norms are the biggest driver. And in many cases, I wish my graduate education taught me how to replicate someone else's results as part of my educational practice. Interesting.

Starting point is 00:10:28 Because the best way to learn how to document the methods you used or the way that you analyze your data is to repeat what someone else did and see if you can get the same result. It's easier to see someone else's error than it is in yourself. And the best way to do it is, yeah, replicate. But it's not going to be just the researcher. So to be really clear, like there's a lot of actors. in the system, right? The journals have a role. So as they change their policies, that helps. Institutions have a role in this as well. What do they hold accountable to their researchers as well as

Starting point is 00:10:58 how they train students? And funders have a role, right? A lot of this, if we're talking to the US, that's NSF NIH largely for the most part, but every funder has a role in terms of what they are asking of their researchers and how they support the research. So say that differently. If you just support the research and the paper, but you don't support that rigor behind it in terms of sharing and documentation, then you don't get that rigor and documentation that now we're essentially having a challenge as we sort back through the research one more time. We're having a hard time finding things because it wasn't prioritized. Isn't that eventually going to bite you later? You mean in the sense of like, I'm a researcher, I publish something. Yeah. Maybe share everything.

Starting point is 00:11:37 Yeah, I mean, so there's a couple thoughts here, right? So one would be, oh my gosh, there's somebody shady hiding something. There's always bad actors out there. I'm sure that's the case. I think a lot of this is honest, like just we're busy people. This is really hard. Research is really, really hard. And I think the vast majority of what we're finding is just, wow, when I just kind of rush through trying my best, but honest mistakes kind of, this little ones can pile up over and over.

Starting point is 00:12:04 And especially since we're so driven by positive results and that kind of really positive storytelling, it's very easy to think that there's a mistake when you find something exciting. It's only when you don't find what you expect. That's when you scrutinize. So I think that's actually what's going on for the most part. At some point, we're investing in the wrong spot at the wrong point in time, which is, again, when we get back to the point of replication of what DARPA was trying to do, that's the big million dollar question.

Starting point is 00:12:29 How much confidence do I have in a given result at that given moment in time? It's never going to be 100%. After the break, how big a role is AI going to play in replication? Stay with us. AI was mentioned a little time ago in our discussion. Tim, what role do you see AI? playing in the future of replication work? That's a great question.

Starting point is 00:13:03 So I think I see two futures. I can't quite tell which one we're going towards. We're probably going towards both. I see one where we're kind of entering it a bit, which is it's easy to see AI generated anything these days. And especially since a lot of the scientific process is communicated through written word and journals, it's very easy to have AI generate that,

Starting point is 00:13:24 which means it's really challenging, even more so, to say, how replicable is this research when you're like, wait, did a human actually do this? Or is this just AI generated language, right? Because it's really clever. So this is going to cause, and it already is causing problems in terms of trying to understand what do we know from AI versus non-AI. But I actually think there's a huge promise at the exact same time, right? So when we think about some of the low-hanging fruit challenges we have, part of the challenge of figuring out how to share your data is how to describe your data or where to deposit your data. Those can get improved with AI. If we do have access to data and we have access to code, you can actually start

Starting point is 00:14:00 to have AI agents run the reproductions ourselves. That's very simple. In fact, I already know tools that can do that. If you give them the code and the data, they can run it themselves again. Now, this is where it starts to go a little farther. If you want to have them develop AI agents that can do plausible different analytical strategies, which I think is amazing, that robustness check, well, we're getting there, right? There's a large universe of plausible analyses. The trick is going to be do AI agents just do everything. And in that case, sometimes it's really good designs. Other times it's just gobbly gook. Right. It's like, I don't know, they just threw a bunch of variables into an algorithm and popped out an answer. But that would be an amazing tool because, again,

Starting point is 00:14:39 we as humans are really good at pattern recognition. But if we're only picking one analysis, we know that's not looking at the possible universe of plausible approaches to test a hypothesis. But AI could help us. All right. As I wrap up here, for people listening, Abel, let me start with you. What do you think the takeaway is for both your work and the score project? I think some people listening might think, what science headlines can I trust? I think that's a fair statement, unfortunately. And so the way I tell people, like the way I consume research personally is,

Starting point is 00:15:14 if I see a new result, something like innovative, like the first time that I hear about something, I don't believe it. And I wait that other researchers find. a similar result and again and again. And maybe after three, four, five times, I start to believe it. And there's nothing wrong with that, I think. Like, I know we like headlines and we like progress, but I think there's a cost to that.

Starting point is 00:15:41 And the cost could be that we start doing lots of research along the same lines without making sure that actually the foundation of the initial result are strong. My personal problem is I don't know which result I can really trust versus those that I cannot trust. And it's very annoying. And I'm patient because of that. And I don't put all my eggs in the same basket. And I wait to see whether things replicate, whether all the researchers are going to find the same pattern and so on and so on. The same way as I think during COVID, the first time there was a vaccine, people were like,

Starting point is 00:16:15 ah, you know, a bit skeptical. But then two or three companies came up with vaccine. And now you're thinking, okay, maybe there's something to it. And I think it's the same for pretty, much anything in life. You just need things to be repeated and replicated. And that's how you build confidence into a scientific result. Tim, you want to weigh in on that? Yeah. Science is a process. It's really easy to forget that and it's really important to remember it, right? Each of our findings, each paper, each headline we read about that's just a piece of a puzzle. We're trying our best. We're humans and we're at the forefront of knowledge. It doesn't mean that somebody publishes a paper and all of a sudden that is quote unquote the truth.

Starting point is 00:16:56 All we're doing is trying to get closer and closer and sometimes going backwards is closer. The second thing is think about all the amazing benefits of research in our society around us every single day. And we just told you it's really not that optimal, right? It's not optimized that well by applying the scientific process to how we do science. And so to me, I think, wow, this is a great opportunity for us.

Starting point is 00:17:18 We're doing amazing things, and there is a lot more that we can do if we can keep improving the way that we conduct and share our research. Well, hopefully giving some light to it here on this show will help. I want to thank both of you for taking time to be with us today. Thank you so much. You're welcome. Thanks for having us. Dr. Tim Arrington, Senior Director of Research at the Center for Open Science and Dr. Abel Baudur, founder of the Institute for Replication.

Starting point is 00:17:45 This episode was produced by D. Peter Schmidt. I'm Ira Flato. Thanks for listening.

Science Friday - Why so many studies can’t be replicated

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.