Microsoft Research Podcast - Abstracts: NeurIPS 2024 with Jindong Wang and Steven Euijong Whang

Starting point is 00:00:00 Welcome to Abstracts, a Microsoft Research podcast that puts the spotlight on world-class research in brief. I'm Gretchen Huizenga. In this series, members of the research community at Microsoft give us a quick snapshot or a podcast abstract of their new and noteworthy papers.

Starting point is 00:00:24 Today, I'm talking to Jindong Wong, a senior researcher at Microsoft Research, and Stephen Hwang, a tenured associate professor at the Korea Advanced Institute of Science and Technology. Jindong and Stephen are co-authors of a paper called ER Bench, an entity relationship-based automatically verifiable hallucination benchmark for large language models. And this paper is a spotlight at this year's conference on Neural Information Processing Systems, or NeurIPS, in Vancouver, BC, this week. Jindong and Stephen, thanks for joining us on Abstracts.

Starting point is 00:01:01 Thank you. Nice to be here. It's great to be here. So, Jindong, I'll start with you. Nice to be here. It's great to be here. So, Jin Dong, I'll start with you. In just a few sentences, tell us what problem your research addresses and why people should care about it. Okay. Everybody knows that with the widespread usage of large-length models, hallucination has become a crucial factor of concern. Hallucination occurs when the models generate false or non-existent information. In particular, factual hallucination greatly undermines

Starting point is 00:01:31 the reliability of the large-dange models. To correctly evaluate the hallucination, evaluating the model's rationale is also important. Up to date, when the paper was submitted, there were no words dealing with automatic rational evaluation systematically, because most of them focused on manual evaluation or just using GPT-Gearch.

Starting point is 00:01:51 ERBench is the first one to generate an large-density model evaluation benchmark utilizing relational databases. Relational databases are based on the relational data model assuming a fixed schema. The fixed schema enables relational databases to have data integrity that are based on the relational data model assuming a fixed schema. The fixed schema enables relational databases to have data integrity that are based on database design theories. So that integrity constraints in relational databases allows better evaluation of language models.

Starting point is 00:02:17 Functional dependencies allow automatic rational evaluation using the functional dependency, inferred keywords, and foreign key constraints also allow for easy generation of the multi-hop questions, which are usually very complicated to generate with other techniques. So that's basically what we want to do. So one sentence, we try to build an automatic evaluation benchmark for evaluation of the hallucination.

Starting point is 00:02:43 Stephen, give us a quick overview of your research methodology and findings. How did you conduct your research and what were your major takeaways? Sure. So this was a collaboration between our group at KAIST and the Dr. Xing Xie's group at MSRA. KAIST is Korea Advanced Institute of Science and Technology. So we had the privilege to closely work with our LLM expert, Dr. Jindong Wang here. We also acknowledged the Microsoft Accelerating Foundation

Starting point is 00:03:14 Models Research, or AFMR program, for using Azure Coda for our experiments. So we had some bi-weekly meetings for maybe over a year, and at some point we figured that relational databases could be really important for LLM evaluation. I personally have a background in databases, which I studied at Stanford University as a PhD student. So relational databases have integrity constraints that can be used to better construct complex in-depth questions and

Starting point is 00:03:47 verify answers. So the first ingredient is functional dependencies. So these are constraints where given a few attributes, you can determine another attribute. So I'll just give an example because I think that helps the understanding. So suppose that you have like a movie table, and in a movie, you have the title of the movie, the year of production, and the director of the movie, and the length of the movie, and so on and so forth. So if you know the title and year of the movie, that pretty much identifies the movie, and you can actually determine the director of the movie as well.

Starting point is 00:04:23 So for example, if you know that there's a movie called Star Wars, which is a very popular movie produced in 1977, that determines the director. We know it's George Lucas, right? So basically it's like a function. It receives the Star Wars 1977 and determines, gives the output George Lucas. So that's the first ingredient. Now, the reason this is important is that we can use these functional dependencies to pinpoint critical keywords that an LLM must know to properly answer a given question containing certain attribute values. For example, we may ask the LLM, is there a

Starting point is 00:05:06 director of a movie called Star Wars produced in 1977? And the LLM can say yes. And it is the right answer, but we'd like to know if the LLM is knowing what it's saying, right? And so we look at the rationale. That's why looking at the rationale is important. We just can't say it's doing the correct thing. So if the LLM mentions George Lucas, bingo, that's a great answer. However, if the LLM mentions some other director like Steven Spielberg, that's not a correct rationale. So that's exactly what we're trying to evaluate. Functional dependency is key to being able to do that kind of verification.

Starting point is 00:05:43 The second ingredient is foreign key constraints. So foreign key constraint is where one of the attributes in one table can intuitively link to another attribute of another table. So in our movie table, we had the director attribute. Now, we may also have a separate table called the director table. And maybe we might have some more information about the director in that table, like the director name, the director's age,

Starting point is 00:06:11 all sorts of information about the director. So foreign key constraint basically requires that if there is some director mentioned in the movie table, it has to be one of the directors in the director table. So this basically links a table to another table. It's very useful. So using this, what we can do is we can join the two tables, right? So now we can join the movie and director table and generate a bigger table. The reason this is useful is that

Starting point is 00:06:38 we can also chain together functional dependencies that I just mentioned into longer functional dependencies. So what this enables is us to construct more complex questions arbitrarily that are multi-hop. So using these integrity constraints, we can basically convert any relational database into an LLM benchmark. And this supports continuous evaluation as the database changes. We can also support multimodal questions and also support various prompt engineering techniques. Well, I would ask you to kind of drill in on what you found in how ERBench compares to other benchmark tests.

Starting point is 00:07:20 So we evaluated our benchmark on five domains and performed comprehensive analyses in terms of answer and rationale accuracies and hallucination rates using single, multi-hop, and multi-modal questions, and also performed prompt engineering and fine-tuning. And what we found is that some LLMs, like GPT-4, are relatively aggressive and good at answering lots of questions. Other LLMs like Gemini tend to be a bit more conservative and do not answer as many questions, but instead hallucinate less as a result. So the key conclusion is that no LLM totally subsumes the other in all aspects, which is the reason why we use multiple measures. And the key message we want to make is that overall, ERBench is effective in evaluating any LLM's thought process by pinpointing critical keywords within the rationale. Well, Jindong, back to you. Research settings are one thing, but tell us how your work is significant in real-world settings,

Starting point is 00:08:26 and who does this impact most and how? Relational databases, they are everywhere across various domains. Anyone can easily get access from Google or from Kaggle, or even create them. You're targeting the domain or a subject that one wants to test the model on. So taking into account that ER Bench is the first work to utilize the relational database. So taking into account that ER Bench is the first work to utilize the relational database for generating large-language model hallucination benchmarks. So this work will lead a new research direction

Starting point is 00:08:53 of integrating database design theories and techniques, a long-studied field. Database is very traditional, old and classic, but they're still operating right now into the large-language model field, a recently emerging area. Right. Well, Stephen, as we close, I assume there are still a few unanswered questions or unsolved problems in the field. What do you propose to do about those and what's next on your research agenda?

Starting point is 00:09:19 Sure. So the big picture is that we basically propose the first work to properly evaluate the rationale of LLMs, right? This is very important because LLMs are being used in our everyday lives and everyone has the question, is the LLM suitable for my task? Can I benefit from the LLM? So it's very important to verify if the LLM knows what it's saying. So I just mentioned that we use functional dependencies to pinpoint critical keywords in the rationale. And we believe that's just the first step. It's very effective, by the way. So you may have the question, is it enough to just look at like the George Lucas within a long rationale? And it turns out 95% of the cases, it is actually effective. So we did human studies and also used GPT-JUG to verify that. But these are factual questions, and there could be various other questions that require long answers, right, long rationales.

Starting point is 00:10:16 And so the important question is, can we also verify all the rest of the rationales, the complicated rationales as well? And so in order to properly do that, we need a lot of technology. So first we need to understand the rationales using NLP techniques, and we need to know if it's properly answering the question, and so on and so forth. And so we believe that there's a lot of opportunity to expand from that. So we basically proposed an initial work towards this direction, but we believe that there are many more interesting challenges that remain. Well, Jin Don Wong and Stephen Huang, thanks for joining us today. And to our listeners,

Starting point is 00:10:58 thanks for tuning in. If you're interested in learning more about this paper, you can find a link at aka.ms forward slash abstracts. You can also find it on Archive and on the NeurIPS website. And if you're at the NeurIPS conference this week, go to the poster session and talk to the authors. See you next time on Abstracts. Thank you.

Microsoft Research Podcast - Abstracts: NeurIPS 2024 with Jindong Wang and Steven Euijong Whang

Researcher Jindong Wang and Associate Professor Steven Euijong Whang explore the NeurIPS 2024 work ERBench. ERBench leverages relational databases to create LLM benchmarks that can verify model ration...ale via keywords in addition to checking answer correctness. Read the paperGet datasets and codes

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Microsoft Research Podcast - Abstracts: NeurIPS 2024 with Jindong Wang and Steven Euijong Whang

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Microsoft Research Podcast - Abstracts: NeurIPS 2024 with Jindong Wang and Steven Euijong Whang