Orchestrate all the Things - Scientific fact-checking using AI language models: COVID-19 research and beyond. Featuring Allen Institute for AI Researcher David Wadden

Starting point is 00:00:00 Welcome to the Orchestrate All the Things podcast. I'm George Amadiotis and we'll be connecting the dots together. This is episode 5 of the podcast featuring Allen Institute for AI visiting researcher David Wadden. Wadden is the lead author of a research paper titled Fact or Fiction? Verifying Scientific Claims. The research paper details efforts to do scientific fact-checking by collecting and annotating research findings and training AI language models to process them.

Starting point is 00:00:32 The methodology has been experimentally applied to COVID-19 related research. Wadden and myself connected to discuss the rationale, details and directions for this work. I hope you will enjoy the podcast. If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn and Facebook. So maybe a good place to start would be, and a typical one also, would be if you would like to share a little bit about yourself, actually, like your background, what have you been working on up to now, what motivated you to do this work, this kind of thing?

Starting point is 00:01:10 Yeah, sure. So I am a grad student at the University of Washington studying natural language processing. And I guess from my background, I've kind of bounced around a little bit. So in college, I studied physics. And then after college, I tried a few things and then worked on computational biology for a while. And then for graduate school, I wanted to try something a little different. And I was kind of interested in NLP and kind of NLP for science in particular because from working in computational biology, it just seemed like there was this huge kind of collection of like experimental findings that people had made. And there were some kind of databases where people collected them, but they were kind of incomplete and not always that well maintained. And it felt like we just needed automated tools to help us collect and reason over,

Starting point is 00:02:12 you know, these massive bodies of knowledge that we were accumulating. So that was what inspired me to go into kind of NLP for science. And this particular project was partly motivated by NLP for Science in general, and partly actually, I guess, my own experiences in the medical system, actually. So I also was briefly, I was briefly considering a career as a singer, as an opera singer, and I had a vocal injury where I had a nerve damage on one side of my throat. And basically, I went to a bunch of doctors, and they kind of told me about a bunch of different things they could try to do. But even like these were very good doctors,

Starting point is 00:02:56 but none of them had off the top of their heads, very solid exact data about like how, how well this treatment would work versus the other one, what percentage of patients was it useful in. And this was something, you know, it wasn't a huge deal because it was a vocal injury, but I kind of wondered, you know, what if I had cancer or some really serious disease like that? And I was in the same position where a doctor was too busy to stay up on the very latest research and I didn't have all of the information to make my decisions. What would I want? And ideally, I'd want a system where I could input questions

Starting point is 00:03:35 or claims about possible treatments and it would give me the latest research supporting and refuting the claims that I made. So that was kind of what motivated this project. And then the Allen Institute where I did my internship was kind of interested in similar things. And there were a lot of other researchers interested in that kind of stuff. So we kind of talked a lot and this is what we came up with.

Starting point is 00:03:58 Okay. Yeah. It sounds like a very good mix, I would say, of different factors that motivated you, like personal experience and then scientific background and kind of like the perfect storm in a way. And having a background in computational biology, I think also may, I'm kind of guessing, may answer the reason why you chose to target COVID-related research specifically. I mean, you know, besides the obvious fact that it's very, very timely, it sounds like it's at least partially, you know, related to your own background. Yeah, so the COVID thing was actually, it was kind of lucky timing, because we started

Starting point is 00:04:47 this project back in October before the COVID thing, before COVID really had emerged. And so we were thinking about this kind of independently, just as a way of kind of managing scientific literature in general. And then this crisis came up and I felt like this was potentially a helpful application. Yeah. Okay. And one thing, I mean, listening to you describe, you know, basically your own experience with different doctors and different potential treatments, it sounds, you know, it echoes experiences I have heard from other people, not necessarily in medical research also,

Starting point is 00:05:33 although, you know, that's obviously very much the same, I would say, but also just doing people research from people doing research in whatever field basically you know the the overarching thing is like you know research has expanded to that uh to that level it's it's basically you know it's basically not even possible for a single person no matter how niche your your domain may be to even have an overview of everything that's being published you know at this at this rate so the only way more and more it seems like more and more people have come to the realization the only way to even have a chance of coping would be to actually automate the way people read research itself and it seems like

Starting point is 00:06:19 you know that that's your motivation as well so with that, that's why I also mentioned, you know, a couple of other efforts that I'm aware of and specifically, which I think may be somehow related to what you're trying to do. That's why I wanted to have your view on what, you know, these people are doing and how they're going about it. So one of them is called COVID Graph, and it's obviously related to the COVID outbreak. And it's basically some people who have tried to gather all kinds of related research and structure it as a knowledge graph and try to navigate it as best as they can.

Starting point is 00:07:00 And there's another effort, again, based on knowledge graphs, which kind kind of predates that and it's not COVID specific but again the idea is pretty much the same. Get everything together in a knowledge graph and try to make it navigatable, let's say, for people to use. What you are doing, the approach you are taking, so applying L&D, to me seems like kind of orthogonal. And I think, you know, they could use your approach and you could use theirs. And I'm wondering what your take is on that. Yeah, I definitely think that's true. So, yeah, I've definitely seen some of the knowledge graph-based approaches for understanding COVID.

Starting point is 00:07:40 And I know there's a couple colleagues at the Allen Institute for AI who are also doing that sort of thing. I think our, so our approach wasn't, like I mentioned, it wasn't initially for COVID, but I think it's, we're trying to apply it that way. And it might kind of give information that information that's complementary to what you'd get from a knowledge graph so maybe you know with a knowledge graph maybe you have um one node is a drug and one node is a disease and you have an edge between them um but a lot of times there won't be kind of an explanation for the edge or like you might want the the actual snippets of text from the papers kind of explaining

Starting point is 00:08:26 their relationship along kind of annotating that edge and i think maybe that's where something like our system could could potentially um be useful is filling in more detail on on the edges and giving you fine-grained explanations for the relationships in the graph. But like you say, the issue is that the approach we use, definitely it's very difficult to scale. Like our data set was pretty small because it requires pretty intensive human annotations from experts. So we definitely, to make it more widely useful, we definitely want to combine it with something like a larger scale

Starting point is 00:09:05 knowledge graph with more nodes and edges. Yeah, actually, you know, the way they ingest information for these knowledge graphs is semi-automated, let's say. But yes, I think you're right. So applying some kind of NLP to that could definitely help not only increase their precision, basically, you know, as to what nodes and edges they leverage in the system, but also, as you very correctly pointed out in terms of adding explanations, basically, so this edge here is here because of, you know, this snippet, and so on. You did mention, and it was a thing that caught my eye

Starting point is 00:09:49 reading your paper, that a lot of what you do, I would say, anyway, I'm not a venturer to label it with a percentage, but let's just say a lot of what you do depends on manual annotation. And I realize this is an initial work on venturing on this. But I wonder if you have any thoughts on how you can scale it up because it's very promising.

Starting point is 00:10:15 It's already, I would say, quite useful. But to deal with the kind of body of research that you eventually want to do in the real world you know you have to scale it up somehow and i'm sure you have thoughts on that yeah um so the way that we approach it so in kind of in the nlp field in general the way that things have been going is that people have been getting really large manually annotated data sets. So in 2016, the work in question answering got stimulated by this Stanford question answering data set, which was something like 100,000 of these question answer pairs. And then, in fact, checking, there was what's called the Fever data set,

Starting point is 00:10:59 but it also had hundreds of thousands of claim evidence pairs for Wikipedia. And what powers those huge data sets is relying on crowd workers from Mechanical Turk. And so that's a little trickier when you require expert knowledge like in our setting. So our current solution was just to hire biology undergrads and grad students who we know could read and understand medical literature, and they did our annotations. But you're right. I guess the first question is, how much do we have to scale up in order to kind of get all the mileage we can out of big neural models? Because at some point, we'll have enough data. The models will have learned all they can and will saturate.

Starting point is 00:11:48 And then I guess, you know, once we know what that threshold is, how do we actually collect the annotations? So there's a few different approaches that people have tried. There's a group at Northeastern that has done a lot of work collecting annotations for medical literature. And one approach they've used is to hire doctors through this online platform called Upwork. They've had success with that. So we can try that. Another approach that people have tried is you just hire Turkers, but you hire a bunch of Turkers, like five or 10

Starting point is 00:12:16 Turkers for each question. And then you use some scheme, like knowing that each one will be individually noisy to average the answers together. And that generally seems to improve quality. So I guess those are both options. And then another thing maybe would just be, I guess if we create a website for this, just ask people, don't pay them, just ask medical experts to contribute their own time and try to crowdsource it that way. And I think some groups have had success with that, but that's obviously more challenging. So those are some of the things we're thinking about.

Starting point is 00:12:52 Okay, nice. So this is one of the challenges with manual annotation. Another one I would point out is, well, basically the fact that it kind of introduces, it can at least introduce bias in the sense that, you know, the kind of questions that you ask are, you know, are the kinds of natural bias. They tend to focus your attention. So I wonder if you have a specific idea or methodology, what is an important question to ask? What is an important claim to prove or disprove? Yeah.

Starting point is 00:13:36 So I think one nice thing about the way we collected our data set is that all of our claims come from citation sentences. And because of that, I think we've removed a lot of the bias that annotators might have in terms of what kind of claims they would write. Like if I were an expert on lung cancer, maybe I would only write claims about lung cancer. But because the claims are based on just showing people citations that have been randomly selected from the literature, we think we should be getting essentially a distribution of claims that reflects the claims that people are making in the actual literature. Does that kind of address what you were thinking about? Yeah, yeah, it does.

Starting point is 00:14:22 At least partially. It does in the sense that if you did the kind of random selection of what claims you would include, then yes, I do see how that may eliminate bias. But that doesn't necessarily mean these are the most important questions to ask. So I wonder if you have any ideas about that, because in a paper, you may have a number of citations. Not all of them are necessarily of the same importance. And I think you mentioned at some point that you use abstracts, only the text from abstracts in terms of justifications. And you also have a rationale there. So you mentioned that it's quite complicated, basically, to get justifications from multiple citations, which is totally true.

Starting point is 00:15:18 I totally get it. But again, I wonder if you have any ideas of how you may expand the system in a way that's able to do that in the future. Because having a result into a single reference as a justification is not always doable, basically. Yeah. So I think there are a couple of questions there. And I think the first one is, how do you know whether the citations you're using are actually alluding our cited papers first, and then we randomly select citations that reference those papers. And so for our cited papers, we only choose papers that have been cited at least 10 times, I think, just as some guarantee that somebody's read this paper and somebody's found it useful. So I'm not sure 10 is the right number, but one kind of simple thing you can do is just set a, you know,

Starting point is 00:16:32 a filter on the number of times a paper's been cited. And I think, so that's one thing. I guess another thing is that we ask people to, so there's people work on what's called citation intent classification which is when you when you cited this paper were you cited it did you cite it because you're using a method from a paper like maybe you're using a hidden Markov model or are you citing the paper as background or are you citing it because you're building on a finding in that paper. So there's different intents.

Starting point is 00:17:06 And when people were writing these questions, we asked them to skip method citations and only keep the citations where the authors were stating a finding from a different paper. So that's another way that we tried to focus on salient findings rather than other types of citations that you might find. And I think you had one more question about how do you find multiple sources of evidence for a single claim. Is that right? Yeah, yeah, indeed.

Starting point is 00:17:39 Yeah, so one kind of good thing about this is that a lot of citations actually will cite multiple papers. The way that our data set is constructed, we construct claims based on citations. And one nice thing is that many citations cite more than one paper. You know, a claim like smoking causes lung cancer or something, there could be dozens of papers that had that same finding. And so if you have a citation about something like that, it will cite at least some of those papers. And the way that we set it up is that we actually ask annotators to identify evidence in all of the abstracts cited by a given citation. And then the other thing is that the way our,

Starting point is 00:18:27 kind of the task is set up is that it doesn't, basically it works at the level of claims and abstracts. So given a claim, it treats each abstract independently in terms of making a decision as whether it's evidence or not for that claim. And so the system should be able to just, if there's multiple abstracts that provide evidence, it just makes independent decisions for each of those.

Starting point is 00:18:52 And if they all provide evidence, they all get returned. Yeah. Okay. Got it. All right. So you've been very efficient, actually, because at the same time, while listening to your answers, I have questions I have had listed in front of me. And it seems, you know, even though we address them in a kind of half-hearted way, I think we got most of them answered, well, either implicitly or explicitly. And I think probably the one that we haven't

Starting point is 00:19:26 is the part about future work, even though we kind of touched upon that already. And to make it a bit more concrete, because in terms of research directions, as I said, you already kind of touched upon that. I was wondering if you have any plans in terms of, because I think you're a visiting researcher at the Allen Institute, so if you're going to be extending your stay there, let's say, if you're going to be working with the same team, if there's interest from other people

Starting point is 00:19:59 in joining you and, you know, following up on that work and so on. And actually, I should have touched on it in the first place. So this is a preprint. Have you submitted it for publication already somewhere? Are you waiting to hear back? Yeah. So for the first question, yes. So I'm a grad student at the University of Washington. And basically, the UW has a pretty close partnership with the Allen

Starting point is 00:20:26 Institute for AI. And, you know, I was really lucky to get to work there, kind of because my advisor has a position there and knew that I was interested in scientific NLP and kind of said, this team works on things that you're interested in, why don't you spend some time there so that was really nice of of her and of them to let me come work with them um but what's nice is that i can kind of keep doing the same thing and collaborating with the same people and what the only thing that really changes is i guess the the official title that i'm a utah grad student instead of an intern. As for kind of directions for continuing to work, I think one obvious one that we already talked about is just making more data and figuring out good ways to do the crowdsourcing that allows it to scale, but also allows us to

Starting point is 00:21:18 collect some larger number of annotations. Two other questions in terms of kind of collecting annotations that we're thinking about are what we call kind of background evidence and partial support. So the partial support issue is suppose that there's some mouse model of cancer and there's a paper that finds that, you know, some drug is effective for this given mouse model of cancer. So that's my cited paper. And then I have some citation that says in, in this other paper, um, they find that, um, drug X, um X is a good treatment for cancer in humans. So the claim talks about cancer in humans, and the paper it cites talks about cancer in mice. So like in that situation, if they're talking about the same drug, then you kind of, the fact that the drug works in mice provides some support for the fact that it might work in humans, but it's definitely not conclusive. And so what you'd want is some way to reflect kind of the incomplete overlap. And so what we want to have people do is when they get situations like that, where the context of the claim and the

Starting point is 00:22:35 evidence doesn't quite match, we wanted them to try to edit the claim so that the edited claim matches the context of the evidence. And then you could have an additional NLP task where you could ask the system not only does this evidence support the claim but also do you have to edit the claim at all in order to make the support a good fit. And so we tried that and basically what we found is that there is relatively high disagreement on whether a claim actually provided full support or kind of there was a mismatch and you needed to rewrite it or whether there was no support at all and so one thing that we're interested in doing is going back

Starting point is 00:23:15 and kind of improving our annotation guidelines to better figure out when is this partial situation occurring so that we can add those to our data set. So that's one thing we want to work on. Another thing is kind of dealing with what I would say is, I guess, kind of long-range co-reference relationships in evidence sentences that we find. So in structured abstract, you'll a lot of times have an intro section and then a method section and later you'll have a result section. And so you'll often find something where in the method section, it says something like the intervention group was given Lipitor for cholesterol management and the primary outcome was cholesterol level. And then, you know, a couple of paragraphs later in the results section, it'll say,

Starting point is 00:24:11 the intervention group experienced a 20% reduction on the primary outcome relative to the control group. And so in that result sentence, like that's the evidence that supports your claim about Lipitor and cholesterol. But in that sentence, it doesn't actually mention Lipitor or cholesterol at all. It just talks about the intervention group and the primary outcomes. And so basically, other people have run into this. And our current solution is just to kind of punt and ignore the co-reference issues and just take that main sentence announcing the finding but we want to figure out some better way to kind of model the fact that a lot of times

Starting point is 00:24:52 you can't find a single self-contained sentence with the answer you kind of have to pull in context from other points in the document um and we also so we had some work on having annotators kind of annotate the extra context, but again, agreement was low. So we're interested in that too. I guess, oh, one more thing that I forgot to mention is assessing the credibility of the paper or kind of the strength of the evidence. So right now, you know, we kind of wait like a pre-print that was just published on bioarchive the same as

Starting point is 00:25:27 something that was published in nature or sell at the new england journal and we would like to actually combine you you mentioned earlier um knowledge graphs so one place where knowledge graphs could be useful maybe is that you could use knowledge graphs to see, to kind of compute measures of the importance or the strength of a given paper. And maybe we could weight the evidence of each paper by its importance as computed based on some kind of knowledge graph representation. So those are all. Yeah, go ahead. Sorry. No, I was going to say this is a very reasonable avenue to pursue. And it's my omission, because I should have asked him myself. It kind of struck me while reading this.

Starting point is 00:26:13 So you could, you probably should actually build something like a page rank, let's say, of core references and citations and so on. And actually, I'm pretty sure I've seen a couple of models for doing that around things like calculating journal impact factors and so on. So you could probably use those. Yeah, yeah. I think there's definitely kind of at the end of the preprint,

Starting point is 00:26:40 we talk about how really this is one very small step that we're kind of focusing on just making a judgment based on kind of taking the, the text as it is kind of trusting what's written in the abstracts. But like you say, like in order to really build a good like evidence system, synthesis system that you could actually use to make real decisions, you need to, like you said you need to incorporate importance is using like page rank type type algorithm and stuff like that yeah so you need to and

Starting point is 00:27:16 to wrap this up let's address the the final question which is have you actually submitted this for this paper I paper for publication anywhere? Are you waiting to hear back? Right, yeah. So the way it works in the NLP community is that there's a one-month anonymity period. So you post on Archive a month before the conference deadline, and then for that intervening month, you can't post anything more and normally you can't um discuss or um kind of announce the work at all um and this is with the hope that when it's submitted to the conference it will be somewhat anonymous and the reviewers won't be able to figure out who's writing it

Starting point is 00:27:58 um so we're definitely going to submit it and i'm not going to name the conference because i don't know the exact rules about the anonymity period okay um but yeah we're going to submit it, and I'm not going to name the conference because I don't know the exact rules about the anonymity period. But yeah, we're going to submit it. Okay, okay. Well, then good luck with your submission. I mean, I don't know the acceptance rates and even the specific venue you're targeting, but to me that's interesting work, and even the novelty aspects should give you some

Starting point is 00:28:25 some points let's say hopefully yeah i hope you enjoyed the podcast if you like my work you can follow linked data orchestration on twitter linkedin and facebook

Orchestrate all the Things - Scientific fact-checking using AI language models: COVID-19 research and beyond. Featuring Allen Institute for AI Researcher David Wadden

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.