Orchestrate all the Things - Scientific fact-checking using AI language models: COVID-19 research and beyond. Featuring Allen Institute for AI Researcher David Wadden
Episode Date: June 7, 2020Fact or fiction? That's not always an easy question to answer. Incomplete knowledge, context and bias typically come into play. In the nascent domain of scientific fact checking, things are compl...icated. If you think fact-checking is hard, which it is, then what would you say about verifying scientific claims, on COVID-19 no less? Hint: it's also hard -- different in some ways, similar in some others. Fact or Fiction: Verifying Scientific Claims is the title of a research paper published on pre-print server Arxiv by a team of researchers from the Allen Institute for Artificial Intelligence (AI2), with data and code available on GitHub. In this backstage chat, David Wadden, lead author of the paper and a visiting researcher at AI2 and George Anadiotis connected to discuss the rationale, details, and directions for this work. Article published on ZDNet in May 2020
Transcript
Discussion (0)
Welcome to the Orchestrate All the Things podcast.
I'm George Amadiotis and we'll be connecting the dots together.
This is episode 5 of the podcast featuring
Allen Institute for AI visiting researcher David Wadden.
Wadden is the lead author of a research paper titled
Fact or Fiction? Verifying Scientific Claims.
The research paper details efforts to do scientific fact-checking
by collecting and annotating research findings and training AI language models to process them.
The methodology has been experimentally applied to COVID-19 related research.
Wadden and myself connected to discuss the rationale, details and directions for this work.
I hope you will enjoy the podcast.
If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn and Facebook.
So maybe a good place to start would be, and a typical one also, would be if you would
like to share a little bit about yourself, actually, like your background, what have
you been working on up to now, what motivated
you to do this work, this kind of thing?
Yeah, sure.
So I am a grad student at the University of Washington studying natural language processing.
And I guess from my background, I've kind of bounced around a little bit. So in college, I studied
physics. And then after college, I tried a few things and then worked on computational biology
for a while. And then for graduate school, I wanted to try something a little different.
And I was kind of interested in NLP and kind of NLP for science in particular because from working in computational biology, it just seemed like there was this huge kind of collection of like experimental findings that people had made.
And there were some kind of databases where people collected them, but they were kind of incomplete and not always that well maintained.
And it felt like we just needed automated tools to help us collect and reason over,
you know, these massive bodies of knowledge that we were accumulating.
So that was what inspired me to go into kind of NLP for science.
And this particular project was partly motivated by NLP for Science in general,
and partly actually, I guess, my own experiences in the medical system, actually. So I also was
briefly, I was briefly considering a career as a singer, as an opera singer, and I had a vocal
injury where I had a nerve damage on one
side of my throat. And basically, I went to a bunch of doctors, and they kind of told me about
a bunch of different things they could try to do. But even like these were very good doctors,
but none of them had off the top of their heads, very solid exact data about like how,
how well this treatment would work versus the other one,
what percentage of patients was it useful in. And this was something, you know, it wasn't a huge
deal because it was a vocal injury, but I kind of wondered, you know, what if I had cancer or some
really serious disease like that? And I was in the same position where a doctor was too busy to stay
up on the very latest research and I didn't have all of the information to make my decisions.
What would I want?
And ideally, I'd want a system where I could input questions
or claims about possible treatments
and it would give me the latest research supporting
and refuting the claims that I made.
So that was kind of what motivated this project.
And then the Allen Institute where I did my internship was kind of interested in similar
things.
And there were a lot of other researchers interested in that kind of stuff.
So we kind of talked a lot and this is what we came up with.
Okay.
Yeah.
It sounds like a very good mix, I would say, of different factors that motivated you, like personal
experience and then scientific background and kind of like the perfect storm in a way.
And having a background in computational biology, I think also may, I'm kind of guessing, may answer the reason why you chose to target COVID-related
research specifically. I mean, you know, besides the obvious fact that it's very, very timely,
it sounds like it's at least partially, you know, related to your own background.
Yeah, so the COVID thing was actually, it was kind of lucky timing, because we started
this project back in October before the COVID thing, before COVID really had emerged. And so
we were thinking about this kind of independently, just as a way of kind of managing scientific
literature in general. And then this crisis came up and I felt like this was potentially a helpful application.
Yeah.
Okay.
And one thing, I mean, listening to you describe, you know, basically your own experience with
different doctors and different potential treatments, it sounds, you know, it echoes
experiences I have heard from other people, not necessarily in medical research also,
although, you know, that's obviously very much the same, I would say, but also just
doing people research from people doing research in whatever field basically you know the the
overarching thing is like you know research has expanded to that uh to that level it's it's
basically you know it's basically not even possible for a single person no matter how
niche your your domain may be to even have an overview of everything that's being published
you know at this at this rate so the only way more and more
it seems like more and more people have come to the realization the only way to even have a chance
of coping would be to actually automate the way people read research itself and it seems like
you know that that's your motivation as well so with that, that's why I also mentioned, you know,
a couple of other efforts that I'm aware of and specifically,
which I think may be somehow related to what you're trying to do.
That's why I wanted to have your view on what, you know,
these people are doing and how they're going about it.
So one of them is called COVID Graph,
and it's obviously related to the COVID outbreak.
And it's basically some people who have tried to gather all kinds of related research and structure it as a knowledge graph and try to navigate it as best as they can.
And there's another effort, again, based on knowledge graphs, which kind kind of predates that and it's not COVID specific
but again the idea is pretty much the same. Get everything together in a knowledge graph and try to make it
navigatable, let's say, for people to use.
What you are doing, the approach you are taking, so applying L&D, to me seems like kind of orthogonal. And I think, you know, they could use your approach and you could use theirs.
And I'm wondering what your take is on that.
Yeah, I definitely think that's true.
So, yeah, I've definitely seen some of the knowledge graph-based approaches
for understanding COVID.
And I know there's a couple colleagues at the Allen Institute for AI
who are also doing
that sort of thing. I think our, so our approach wasn't, like I mentioned, it wasn't initially for
COVID, but I think it's, we're trying to apply it that way. And it might kind of give information
that information that's complementary to what you'd get from a knowledge
graph so maybe you know with a knowledge graph maybe you have um one node is a drug and one
node is a disease and you have an edge between them um but a lot of times there won't be kind
of an explanation for the edge or like you might want the the actual snippets of text from the papers kind of explaining
their relationship along kind of annotating that edge and i think maybe that's where something like
our system could could potentially um be useful is filling in more detail on on the edges and
giving you fine-grained explanations for the relationships in the graph.
But like you say, the issue is that the approach we use,
definitely it's very difficult to scale.
Like our data set was pretty small because it requires pretty intensive human annotations from experts.
So we definitely, to make it more widely useful,
we definitely want to combine it with something like a larger scale
knowledge graph with more nodes and edges.
Yeah, actually, you know, the way they ingest information for these knowledge graphs is
semi-automated, let's say.
But yes, I think you're right.
So applying some kind of NLP to that could definitely help not only increase their precision, basically,
you know, as to what nodes and edges they leverage in the system, but also, as you very correctly
pointed out in terms of adding explanations, basically, so this edge here is here because of,
you know, this snippet, and so on. You did mention, and it was a thing that caught my eye
reading your paper, that a lot of what you do,
I would say, anyway, I'm not a venturer
to label it with a percentage,
but let's just say a lot of what you do
depends on manual annotation.
And I realize this is an initial work on venturing on this.
But I wonder if you have any thoughts on how you can scale it up
because it's very promising.
It's already, I would say, quite useful.
But to deal with the kind of body of research
that you eventually want to do in the real world you know you have to scale it
up somehow and i'm sure you have thoughts on that yeah um so the way that we approach it
so in kind of in the nlp field in general the way that things have been going is that people have
been getting really large manually annotated data sets. So in 2016, the work in question answering got stimulated by this Stanford question answering data set,
which was something like 100,000 of these question answer pairs.
And then, in fact, checking, there was what's called the Fever data set,
but it also had hundreds of thousands of claim evidence pairs for Wikipedia.
And what powers those huge data sets is relying on crowd workers from Mechanical Turk.
And so that's a little trickier when you require expert knowledge like in our setting.
So our current solution was just to hire biology undergrads and grad students who we know could read and understand medical literature, and they did our annotations.
But you're right.
I guess the first question is, how much do we have to scale up in order to kind of get all the mileage we can out of big neural models?
Because at some point, we'll have enough data.
The models will have learned all they can and will saturate.
And then I guess, you know, once we know what that threshold is,
how do we actually collect the annotations?
So there's a few different approaches that people have tried. There's a group at Northeastern that has done a lot of work
collecting annotations for medical literature.
And one approach they've used is to hire doctors
through this online
platform called Upwork. They've had success with that. So we can try that. Another approach that
people have tried is you just hire Turkers, but you hire a bunch of Turkers, like five or 10
Turkers for each question. And then you use some scheme, like knowing that each one will be
individually noisy to average the answers together.
And that generally seems to improve quality.
So I guess those are both options.
And then another thing maybe would just be, I guess if we create a website for this, just ask people, don't pay them, just ask medical experts to contribute their own time and try to crowdsource it that way.
And I think some groups have had success with that,
but that's obviously more challenging.
So those are some of the things we're thinking about.
Okay, nice.
So this is one of the challenges with manual annotation.
Another one I would point out is, well,
basically the fact that it kind of introduces, it can at least introduce bias in the sense that, you know, the kind of questions that you ask are, you know, are the kinds of natural bias.
They tend to focus your attention. So I wonder if you have a specific idea or methodology,
what is an important question to ask?
What is an important claim to prove or disprove?
Yeah.
So I think one nice thing about the way we collected our data set
is that all of our claims come from citation sentences. And because of that, I think
we've removed a lot of the bias that annotators might have in terms of what kind of claims they
would write. Like if I were an expert on lung cancer, maybe I would only write claims about
lung cancer. But because the claims are based on just showing people citations that have been randomly selected from the literature, we think we should be getting essentially a distribution of claims
that reflects the claims that people are making in the actual literature.
Does that kind of address what you were thinking about?
Yeah, yeah, it does.
At least partially. It does in the sense that if you did the kind of random selection of what claims you would include, then yes, I do see how that may eliminate bias.
But that doesn't necessarily mean these are the most important questions to ask. So I wonder if you have any ideas about that, because in a paper,
you may have a number of citations. Not all of them are necessarily of the same importance.
And I think you mentioned at some point that you use abstracts, only the text from abstracts
in terms of justifications.
And you also have a rationale there.
So you mentioned that it's quite complicated, basically,
to get justifications from multiple citations, which is totally true.
I totally get it.
But again, I wonder if you have any ideas of how you may expand the system in a way that's able to do that in the future.
Because having a result into a single reference as a justification is not always doable, basically.
Yeah.
So I think there are a couple of questions there. And I think the first one is, how do you know whether the citations you're using are actually alluding our cited papers first, and then we randomly select citations that reference those papers.
And so for our cited papers, we only choose papers that have been cited at least 10 times, I think, just as some guarantee that somebody's read this paper and somebody's found it useful.
So I'm not sure 10 is the right number,
but one kind of simple thing you can do is just set a, you know,
a filter on the number of times a paper's been cited.
And I think, so that's one thing.
I guess another thing is that we ask people to,
so there's people work on what's called citation intent
classification which is when you when you cited this paper were you cited it
did you cite it because you're using a method from a paper like maybe you're
using a hidden Markov model or are you citing the paper as background or are
you citing it because you're building on a finding in that paper. So there's different intents.
And when people were writing these questions, we asked them to skip method citations and
only keep the citations where the authors were stating a finding from a different paper.
So that's another way that we tried to focus on salient findings
rather than other types of citations that you might find.
And I think you had one more question about how do you find
multiple sources of evidence for a single claim.
Is that right?
Yeah, yeah, indeed.
Yeah, so one kind of good thing about this is that a lot of citations
actually will cite multiple papers.
The way that our data set is constructed, we construct claims based on citations.
And one nice thing is that many citations cite more than one paper.
You know, a claim like smoking causes lung cancer or something, there could be dozens of papers that had that same finding. And so if you
have a citation about something like that, it will cite at least some of those papers.
And the way that we set it up is that we actually ask annotators to identify evidence in all of the
abstracts cited by a given citation. And then the other thing is that the way our,
kind of the task is set up is that it doesn't,
basically it works at the level of claims and abstracts.
So given a claim, it treats each abstract independently
in terms of making a decision as whether it's evidence
or not for that claim.
And so the system should be able to just,
if there's multiple abstracts that provide evidence,
it just makes independent decisions for each of those.
And if they all provide evidence, they all get returned.
Yeah.
Okay.
Got it.
All right.
So you've been very efficient, actually,
because at the same time, while listening to your answers, I have questions I have had listed in front of me.
And it seems, you know, even though we address them in a kind of half-hearted way, I think we got most of them answered, well, either implicitly or explicitly. And I think probably the one that we haven't
is the part about future work,
even though we kind of touched upon that already.
And to make it a bit more concrete,
because in terms of research directions,
as I said, you already kind of touched upon that.
I was wondering if you have any plans in terms of, because I think you're
a visiting researcher at the Allen Institute, so if you're going to be extending your stay there,
let's say, if you're going to be working with the same team, if there's interest from other people
in joining you and, you know, following up on that work and so on. And actually, I should have touched on it in the first place.
So this is a preprint.
Have you submitted it for publication already somewhere?
Are you waiting to hear back?
Yeah.
So for the first question, yes.
So I'm a grad student at the University of Washington.
And basically, the UW has a pretty close partnership with the Allen
Institute for AI. And, you know, I was really lucky to get to work there, kind of because my
advisor has a position there and knew that I was interested in scientific NLP and kind of said,
this team works on things that you're interested in, why don't you spend some time there so that was really nice of of her and of them to let me come work with them
um but what's nice is that i can kind of keep doing the same thing and collaborating with the
same people and what the only thing that really changes is i guess the the official title that
i'm a utah grad student instead of an intern. As for kind of directions for
continuing to work, I think one obvious one that we already talked about is just making more data
and figuring out good ways to do the crowdsourcing that allows it to scale, but also allows us to
collect some larger number of annotations. Two other questions in terms of kind of collecting annotations that
we're thinking about are what we call kind of background evidence and partial support. So the
partial support issue is suppose that there's some mouse model of cancer and there's a paper that finds that, you know, some drug
is effective for this given mouse model of cancer. So that's my cited paper. And then I have some
citation that says in, in this other paper, um, they find that, um, drug X, um X is a good treatment for cancer in humans.
So the claim talks about cancer in humans, and the paper it cites talks about cancer in mice.
So like in that situation, if they're talking about the same drug, then you kind of, the fact that the drug works in mice provides some support for the fact that it might work in humans, but it's definitely not conclusive. And so what you'd want is some way to reflect kind of the incomplete overlap. And so what we
want to have people do is when they get situations like that, where the context of the claim and the
evidence doesn't quite match, we wanted them to try to edit the claim so that the edited claim
matches the context of the evidence. And then you could have an
additional NLP task where you could ask the system not only does this evidence
support the claim but also do you have to edit the claim at all in order to
make the support a good fit. And so we tried that and basically what we found
is that there is relatively high disagreement on whether a claim
actually provided full support or kind of there was a mismatch and you needed to rewrite it or
whether there was no support at all and so one thing that we're interested in doing is going back
and kind of improving our annotation guidelines to better figure out when is this partial situation
occurring so that we can add those to our data set.
So that's one thing we want to work on.
Another thing is kind of dealing with what I would say is, I guess,
kind of long-range co-reference relationships in evidence sentences that we find.
So in structured abstract, you'll a lot of times have an intro section and then a method section and later you'll have a result section.
And so you'll often find something where in the method section, it says something like the intervention group was given Lipitor for cholesterol management and the primary outcome was cholesterol level.
And then, you know, a couple of paragraphs later in the results section, it'll say,
the intervention group experienced a 20% reduction on the primary outcome relative to the control
group. And so in that result sentence, like that's the evidence that supports your claim about
Lipitor and cholesterol.
But in that sentence, it doesn't actually mention Lipitor or cholesterol at all.
It just talks about the intervention group and the primary outcomes.
And so basically, other people have run into this.
And our current solution is just to kind of punt and ignore the co-reference issues and
just take that main sentence announcing the finding but we want to figure out some better way to kind of model the fact that a lot of times
you can't find a single self-contained sentence with the answer you kind of have to
pull in context from other points in the document um and we also so we had some work on having
annotators kind of annotate the extra context,
but again, agreement was low.
So we're interested in that too.
I guess, oh, one more thing that I forgot to mention is assessing the credibility of
the paper or kind of the strength of the evidence.
So right now, you know, we kind of wait like a pre-print that was just published on bioarchive the same as
something that was published in nature or sell at the new england journal and we would like to
actually combine you you mentioned earlier um knowledge graphs so one place where knowledge
graphs could be useful maybe is that you could use knowledge graphs to see, to kind of compute measures of the importance or the strength of a given paper.
And maybe we could weight the evidence of each paper by its importance as computed based on some kind of knowledge graph representation.
So those are all. Yeah, go ahead. Sorry.
No, I was going to say this is a very reasonable avenue to pursue.
And it's my omission, because I should have asked him myself.
It kind of struck me while reading this.
So you could, you probably should actually build something like a page rank, let's say,
of core references and citations and so on.
And actually, I'm pretty sure I've seen a couple of models
for doing that around things like calculating journal impact factors
and so on.
So you could probably use those.
Yeah, yeah.
I think there's definitely kind of at the end of the preprint,
we talk about how really this is one very small step
that we're kind of focusing on just making a judgment based on kind of taking the,
the text as it is kind of trusting what's written in the abstracts.
But like you say,
like in order to really build a good like evidence system,
synthesis system that you could actually use to make real decisions,
you need to,
like you said you need to incorporate importance is using like page rank type type algorithm and stuff like that yeah so you need to and
to wrap this up let's address the the final question which is have you
actually submitted this for this paper I paper for publication anywhere?
Are you waiting to hear back?
Right, yeah.
So the way it works in the NLP community is that there's a one-month anonymity period.
So you post on Archive a month before the conference deadline,
and then for that intervening month, you can't post anything more and normally you can't um discuss or um kind of announce the work at all um and this is with the hope that when it's submitted to the conference
it will be somewhat anonymous and the reviewers won't be able to figure out who's writing it
um so we're definitely going to submit it and i'm not going to name the conference because i don't
know the exact rules about the anonymity period okay um but yeah we're going to submit it, and I'm not going to name the conference because I don't know the exact rules about the anonymity period.
But yeah, we're going to submit it.
Okay, okay.
Well, then good luck with your submission.
I mean, I don't know the acceptance rates and even the specific venue you're targeting,
but to me that's interesting work,
and even the novelty aspects should give you some
some points let's say hopefully yeah i hope you enjoyed the podcast if you like my work
you can follow linked data orchestration on twitter linkedin and facebook