Science Friday - Why so many studies can’t be replicated
Episode Date: April 11, 2026How do we know what we know? That's where science comes in—it gives us a method for testing our ideas and getting trustworthy results. But some researchers have warned that many scientific studies c...an't be replicated. To find out how deep the problem goes, the US Defense Advanced Research Projects Agency funded one of the largest analyses of social science, called the SCORE project. They checked the results of thousands of papers across economics, education, and psychology—and found that only half of them could be replicated. Joining Host Ira Flatow to discuss the findings are Tim Errington, one of the leads on this project, and economist Abel Brodeur, who recently released the results of a separate replication study that found more encouraging results than SCORE did. Guest: Dr. Tim Errington is senior director of research at the Center For Open Science in Washington, D.C. Dr. Abel Brodeur is a professor of economics at the University of Ottawa and founder of the Institute for Replication. Transcripts for each episode are available within 1-3 days at sciencefriday.com. Subscribe to this podcast. Plus, to stay updated on all things science, sign up for Science Friday's newsletters.
Transcript
Discussion (0)
Hi, I'm Ira Plato, and you're listening to Science Friday.
How do we know what we know?
That's where science comes in.
It gives us a method for testing our assumptions and getting trustworthy results.
And that's really the definition of research.
First you search with an initial study, then you research with follow-up studies to confirm.
But some researchers have warned that many scientific studies cannot be replicated.
This is what's called the
replication crisis. To find out how deep the problem goes, the U.S. Defense Department Research
Projects Agency, you know that is DARPA, funded one of the largest analyses of social science
called the Score Project. With the help of hundreds of researchers, they check the results of
thousands of papers across economics, education, and psychology. And the results? Researchers could
only replicate half the papers analyzed. Here to talk about the project.
and give an update on how the scientific world is trying to change itself.
It's Dr. Tim Errington, Senior Director of Research at the Center for Open Science,
one of the leads on this project.
Joining him is Dr. Abel Brodour, Professor of Economics at the University of Ottawa,
and founder of the Institute for Replication,
who recently released the results of a separate replication study.
Welcome both of you to Science Friday.
Thanks for having me.
Thanks for the invite.
All right.
Thank you for joining us.
We have covered replication on the show in the past.
So what made this project different?
And why was DARPA involved?
Yeah.
So there's a couple aspects to that to describe.
One reason that DARPA was involved, and I can't speak on behalf of them,
is that there were prior projects that were reported having challenges in terms of having confidence in research, right?
How much confidence should we have in anything that's being published?
I think the wrong way to do that is it's binary.
It's published. It equals it's true. Not published means it's not true. So that's definitely not the case. And DARPA was paying attention to this. They use research. They use the social behavioral science research a lot. And they were trying to figure out methods to sort through that, to sort through how much confidence they should have in any given research finding. And so because we have a challenge in understanding how much confidence we should have in something because of this replicability issue, they wanted to invest in seeing if there were ways to develop
automated tools to assist with that.
And so SCORE is a project that was designed to not just repeat the experiments,
but to use that as a ground truth in the development of AI tools to assist with confidence
assessment.
Again, something that's largely done just by humans.
So that's one key aspect was this was started in 2019, so before a lot of that AI discussion
that's going on now.
The other part that makes this quite unique is the breadth.
The breadth of this in terms of disciplines across the social behavioral sciences.
as you mentioned, and also the volume.
We looked at 10 years' worth of journals across 62 different journals.
So that is a much larger scale than any prior project.
What's at stake here?
I mean, how are the papers you analyze being used in making policy decisions?
I imagine that's an important part of it.
Yeah, there's a lot of, I think, in terms of the breadth of the research that she used,
it's used in a lot of different aspects of, right, from policy to, I think, even our own
individual actions.
So let me give you some examples of the type of research that was included in this project.
It could be something looking at how public employees leave the U.S. Civil Service.
You can see how that would have an impact on policy decision-making.
Or, you know, does being the victim of a crime spur political participation, right?
Again, you can see the impacts that that would have in terms of various policy or decision-making.
And those are the types of research across the social behavioral sciences that this project was investigating.
Mm-hmm.
Now, Belle, do you have anything to add to that?
Yeah, I think, you know, one of the big problem is there's a lot of good research out there.
And unfortunately, there's also research that is less good.
Just like to put this in context, a paper might not replicate for many reasons.
It can be just like there's data missing.
You're trying to running it, but somehow it produces different results.
Maybe things are not robust when you start playing with the data.
Or maybe you use completely new data and you get a different result.
And I think what's really cool about all these projects coming out right now is we get a much better idea of actually there's problems everywhere and they compound.
Tim, what was some of the common trends then that you saw in the score project?
Right. One would be just sharing, data sharing. So if you want to ask the question, somebody publishes a finding, they report some statistic.
Can I have confidence in just the reporting of that statistic?
well, in order to do that, I need the data and I need the method that they use, just to repeat that,
that simple reproduction step. But it's hard to do that when you don't share data.
Now, are you saying they don't, just to jump in there, are you saying they don't share the methodology
either? I mean, so you can reproduce what they've done?
Right. So there's a lot of dimensions here that don't always get shared, which is interesting
because you think of that hallmark of sciences, I share everything. And then that's actually,
that's where I have confidence in it. And there's a lot of nuance here. So I'll first say that the
high level statement, which is, yeah, across.
of the board, there's probably not as much sharing as one would expect or hope for. That includes the
data, the analysis, even the methods of collecting the data, the sources of data. It starts to get
really complex. And so when you don't have that information, you're stuck doing a couple things.
You just trust it at face value. It got published. It must be true. But that's not how science
works. Or you're left making assumptions to fill in the gaps. Both of those are not ideal. So one thing
that we can do is start sharing more. And you have to incentivize people to share data. It doesn't
mean it's more reliable. It just means that now you can interrogate it the way that Aval
was just talking about. Mm-hmm. About you also just released a new study,
analyzing economic and political research. I'm wondering how yours was different from the score.
What did you find? I think the results are much more positive, and there's a lot of reasons for that.
And I think the main one is we looked at recent papers. So we started this project in 22,
So we look at papers published in 22 and at the end 2023.
And data sharing is going up.
There's less and less hesitancy, I would say,
an economist vocal science in terms of data sharing.
That doesn't mean everything's perfect.
We still find like coding errors in 15, 20% of papers.
Results are robust maybe 75% of the time.
So, you know, that's better than let's say 50% or some of the rates that were documented.
So I would say things are just getting better.
That masks a lot of things, though.
So for instance, at the Institute for Reification,
we master reproduced studies in economics, political science,
Nile psychology, public health, environmental research,
and data sharing practices are completely different across fields.
So I would say things are getting better.
That's mostly what we document, I think.
But it's definitely far away from being perfect.
Are they getting better because we,
recognize the problem? Or did they just get better for some other reason? I think that's part of the
story. I look back at my own research that I was doing back in 2010 as a master's student. And my
coding was terrible. And I just, I'm not laughing. I'm sorry. And that's fine. I laugh at it too
because sometimes I look back at it and I'm like, dear God, that was bad. But also I remind myself
then back then it was impossible virtue to look at other people's codes.
There was no codes online.
Whereas I look at my own students, P.G students nowadays, and they have access to so much, so much coding of other researchers.
And coding, and coding for the layperson, what does that mean?
Yeah.
So, you know, let's say you're doing a study.
You're interested in the effect of like minimum wage on, I don't know, like unemployment rate or the effect of a policy on differentiation in the Amazonian rainforest.
course. You're going to gather data, maybe satellite data, to understand, like, is there
deforestation happening? Then you have another data set on, like, the policy change, but then you
have some control variables to make sure that you get a causal effect. But then you need to merge
all these data together, so you need to code that in Excel, or it could be in R, Python, et cetera,
status quo software. But then you need to run the regression. You need to actually do the analysis
in the status quo software. And you can make stupid mistakes. It could be like duplicates,
replicates. You have the same individual again and again by accident. It can be you say something
in a paper, but actually in your codes, you did something else. You didn't really look at
deforestation using this specific ways of measuring it. But I think just code review, like,
it's kind of crazy, but imagine you do a research paper. You have your research assistant doing the
coding. You're done. You submit this to a journal, and they accept it. And it went through peer
review, their external expert that looked at it. And then it's published. During the entire process,
nobody ever looked at your data encodes. They trust 100% what you've done. And what we're trying
to do is to go after that and being like, well, let's have a look at the data in codes to see
whether there's errors and things like this. And you would think this happens, it should happen
throughout the process of publishing. But the norms are just not there. So things need to change. And I think
they will on one point because of AI.
Yeah. Tim, what do you think of this?
Yeah.
You asked a really good question of like, is it just happening on its own or, you know,
is part of this discussion actually part of what's causing these changes?
Because I absolutely do agree.
I think things are improving.
And I think there's a couple reasons for it.
One is we're talking about it.
We're talking about it here, right, on Science Friday.
So this is an example of it getting to the point where it's more common to have these
discussions.
I completely agree with the point of.
the norms are the biggest driver.
And in many cases, I wish my graduate education taught me how to replicate someone else's
results as part of my educational practice.
Interesting.
Because the best way to learn how to document the methods you used or the way that you
analyze your data is to repeat what someone else did and see if you can get the same result.
It's easier to see someone else's error than it is in yourself.
And the best way to do it is, yeah, replicate.
But it's not going to be just the researcher.
So to be really clear, like there's a lot of actors.
in the system, right? The journals have a role. So as they change their policies, that helps.
Institutions have a role in this as well. What do they hold accountable to their researchers as well as
how they train students? And funders have a role, right? A lot of this, if we're talking to the
US, that's NSF NIH largely for the most part, but every funder has a role in terms of what they
are asking of their researchers and how they support the research. So say that differently. If you just
support the research and the paper, but you don't support that rigor behind it in terms of sharing
and documentation, then you don't get that rigor and documentation that now we're essentially
having a challenge as we sort back through the research one more time. We're having a hard time
finding things because it wasn't prioritized. Isn't that eventually going to bite you later?
You mean in the sense of like, I'm a researcher, I publish something. Yeah. Maybe share everything.
Yeah, I mean, so there's a couple thoughts here, right? So one would be, oh my gosh, there's somebody shady
hiding something. There's always bad actors out there. I'm sure that's the case.
I think a lot of this is honest, like just we're busy people.
This is really hard.
Research is really, really hard.
And I think the vast majority of what we're finding is just, wow, when I just kind of
rush through trying my best, but honest mistakes kind of, this little ones can pile up
over and over.
And especially since we're so driven by positive results and that kind of really positive
storytelling, it's very easy to think that there's a mistake when you find something exciting.
It's only when you don't find what you expect.
That's when you scrutinize.
So I think that's actually what's going on for the most part.
At some point, we're investing in the wrong spot at the wrong point in time,
which is, again, when we get back to the point of replication of what DARPA was trying to do,
that's the big million dollar question.
How much confidence do I have in a given result at that given moment in time?
It's never going to be 100%.
After the break, how big a role is AI going to play in replication?
Stay with us.
AI was mentioned a little time ago in our discussion.
Tim, what role do you see AI?
playing in the future of replication work?
That's a great question.
So I think I see two futures.
I can't quite tell which one we're going towards.
We're probably going towards both.
I see one where we're kind of entering it a bit,
which is it's easy to see AI generated anything these days.
And especially since a lot of the scientific process is communicated through
written word and journals,
it's very easy to have AI generate that,
which means it's really challenging, even more so,
to say, how replicable is this research when you're like, wait, did a human actually do this?
Or is this just AI generated language, right? Because it's really clever. So this is going to cause,
and it already is causing problems in terms of trying to understand what do we know from AI versus
non-AI. But I actually think there's a huge promise at the exact same time, right? So when we think
about some of the low-hanging fruit challenges we have, part of the challenge of figuring out how to
share your data is how to describe your data or where to deposit your data. Those can get
improved with AI. If we do have access to data and we have access to code, you can actually start
to have AI agents run the reproductions ourselves. That's very simple. In fact, I already know tools
that can do that. If you give them the code and the data, they can run it themselves again. Now,
this is where it starts to go a little farther. If you want to have them develop AI agents that
can do plausible different analytical strategies, which I think is amazing, that robustness check,
well, we're getting there, right? There's a large universe of plausible analyses. The trick is
going to be do AI agents just do everything. And in that case, sometimes it's really good designs.
Other times it's just gobbly gook. Right. It's like, I don't know, they just threw a bunch of
variables into an algorithm and popped out an answer. But that would be an amazing tool because, again,
we as humans are really good at pattern recognition. But if we're only picking one analysis,
we know that's not looking at the possible universe of plausible approaches to test a hypothesis.
But AI could help us. All right. As I wrap up here, for people listening,
Abel, let me start with you.
What do you think the takeaway is for both your work and the score project?
I think some people listening might think, what science headlines can I trust?
I think that's a fair statement, unfortunately.
And so the way I tell people, like the way I consume research personally is,
if I see a new result, something like innovative, like the first time that I hear about something,
I don't believe it.
And I wait that other researchers find.
a similar result and again and again.
And maybe after three, four, five times, I start to believe it.
And there's nothing wrong with that, I think.
Like, I know we like headlines and we like progress,
but I think there's a cost to that.
And the cost could be that we start doing lots of research
along the same lines without making sure that actually the foundation of the initial result
are strong.
My personal problem is I don't know which result I can
really trust versus those that I cannot trust. And it's very annoying. And I'm patient because of that.
And I don't put all my eggs in the same basket. And I wait to see whether things replicate,
whether all the researchers are going to find the same pattern and so on and so on.
The same way as I think during COVID, the first time there was a vaccine, people were like,
ah, you know, a bit skeptical. But then two or three companies came up with vaccine. And now you're
thinking, okay, maybe there's something to it. And I think it's the same for pretty,
much anything in life. You just need things to be repeated and replicated. And that's how you build
confidence into a scientific result. Tim, you want to weigh in on that? Yeah. Science is a process.
It's really easy to forget that and it's really important to remember it, right? Each of our
findings, each paper, each headline we read about that's just a piece of a puzzle. We're trying our
best. We're humans and we're at the forefront of knowledge. It doesn't mean that somebody publishes a paper and
all of a sudden that is quote unquote the truth.
All we're doing is trying to get closer and closer
and sometimes going backwards is closer.
The second thing is think about all the amazing benefits of research
in our society around us every single day.
And we just told you it's really not that optimal, right?
It's not optimized that well by applying the scientific process
to how we do science.
And so to me, I think, wow, this is a great opportunity for us.
We're doing amazing things, and there is a lot more that we can do
if we can keep improving the way that we conduct and share our research.
Well, hopefully giving some light to it here on this show will help.
I want to thank both of you for taking time to be with us today.
Thank you so much.
You're welcome. Thanks for having us.
Dr. Tim Arrington, Senior Director of Research at the Center for Open Science
and Dr. Abel Baudur, founder of the Institute for Replication.
This episode was produced by D. Peter Schmidt.
I'm Ira Flato. Thanks for listening.
