Microsoft Research Podcast - Abstracts: NeurIPS 2024 with Jindong Wang and Steven Euijong Whang
Episode Date: December 13, 2024Researcher Jindong Wang and Associate Professor Steven Euijong Whang explore the NeurIPS 2024 work ERBench. ERBench leverages relational databases to create LLM benchmarks that can verify model ration...ale via keywords in addition to checking answer correctness. Read the paperGet datasets and codes
Transcript
Discussion (0)
Welcome to Abstracts,
a Microsoft Research podcast that puts
the spotlight on world-class research in brief.
I'm Gretchen Huizenga.
In this series,
members of the research community at Microsoft give us
a quick snapshot or a podcast
abstract of their new and noteworthy papers.
Today, I'm talking to Jindong Wong, a senior
researcher at Microsoft Research, and Stephen Hwang, a tenured associate professor at the Korea
Advanced Institute of Science and Technology. Jindong and Stephen are co-authors of a paper
called ER Bench, an entity relationship-based automatically verifiable hallucination benchmark
for large language models.
And this paper is a spotlight at this year's conference on Neural Information Processing Systems,
or NeurIPS, in Vancouver, BC, this week.
Jindong and Stephen, thanks for joining us on Abstracts.
Thank you. Nice to be here.
It's great to be here.
So, Jindong, I'll start with you. Nice to be here. It's great to be here. So, Jin Dong, I'll start with you. In just a few sentences,
tell us what problem your research addresses and why people should care about it.
Okay. Everybody knows that with the widespread usage of large-length models,
hallucination has become a crucial factor of concern. Hallucination occurs when the models generate false
or non-existent information.
In particular, factual hallucination greatly undermines
the reliability of the large-dange models.
To correctly evaluate the hallucination,
evaluating the model's rationale is also important.
Up to date, when the paper was submitted,
there were no words dealing with
automatic rational
evaluation systematically, because most of them focused on manual evaluation or just
using GPT-Gearch.
ERBench is the first one to generate an large-density model evaluation benchmark utilizing relational
databases.
Relational databases are based on the relational data model assuming a fixed schema.
The fixed schema enables relational databases to have data integrity that are based on the relational data model assuming a fixed schema. The fixed schema enables relational databases
to have data integrity that are based on database design
theories.
So that integrity constraints in relational databases
allows better evaluation of language models.
Functional dependencies allow automatic rational evaluation
using the functional dependency, inferred keywords,
and foreign key constraints
also allow for easy generation of the multi-hop questions, which are usually very complicated
to generate with other techniques.
So that's basically what we want to do.
So one sentence, we try to build an automatic evaluation benchmark for evaluation of the
hallucination.
Stephen, give us a quick overview of your research methodology and findings.
How did you conduct your research and what were your major takeaways?
Sure.
So this was a collaboration between our group at KAIST and the Dr. Xing Xie's group at MSRA.
KAIST is Korea Advanced Institute of Science and Technology.
So we had the privilege to closely work
with our LLM expert, Dr. Jindong Wang here.
We also acknowledged the Microsoft Accelerating Foundation
Models Research, or AFMR program,
for using Azure Coda for our experiments.
So we had some bi-weekly meetings for maybe over a year,
and at some point we figured
that relational databases could be really important for LLM evaluation.
I personally have a background in databases, which I studied at Stanford University as
a PhD student.
So relational databases have integrity constraints that can be used to better construct complex in-depth questions and
verify answers. So the first ingredient is functional dependencies. So these are constraints
where given a few attributes, you can determine another attribute. So I'll just give an example
because I think that helps the understanding. So suppose that you have like a movie table, and in a movie, you have the title of the movie,
the year of production, and the director of the movie,
and the length of the movie, and so on and so forth.
So if you know the title and year of the movie,
that pretty much identifies the movie,
and you can actually determine the director of the movie as well.
So for example, if you know that
there's a movie called Star Wars, which is a very popular movie produced in 1977,
that determines the director. We know it's George Lucas, right? So basically
it's like a function. It receives the Star Wars 1977 and determines,
gives the output George Lucas. So that's the first ingredient.
Now, the reason this is important is that we can use these functional dependencies to pinpoint
critical keywords that an LLM must know to properly answer a given question containing
certain attribute values. For example, we may ask the LLM, is there a
director of a movie called Star Wars produced in 1977? And the LLM can say yes.
And it is the right answer, but we'd like to know if the LLM is knowing what it's
saying, right? And so we look at the rationale. That's why looking at the
rationale is important. We just can't say it's doing the correct thing.
So if the LLM mentions George Lucas, bingo, that's a great answer.
However, if the LLM mentions some other director like Steven Spielberg, that's not a correct rationale.
So that's exactly what we're trying to evaluate.
Functional dependency is key to being able to do that kind of verification.
The second ingredient is foreign key constraints.
So foreign key constraint is where one of the attributes in one table
can intuitively link to another attribute of another table.
So in our movie table, we had the director attribute.
Now, we may also have a separate table called the director table.
And maybe we might have some more information
about the director in that table,
like the director name, the director's age,
all sorts of information about the director.
So foreign key constraint basically
requires that if there is some director mentioned
in the movie table, it has to be one of the directors
in the director table.
So this basically links a table to another table.
It's very useful. So using this, what we can do is we can join the two tables, right? So now we can
join the movie and director table and generate a bigger table. The reason this is useful is that
we can also chain together functional dependencies that I just mentioned into longer functional dependencies.
So what this enables is us to construct more complex questions arbitrarily that are multi-hop.
So using these integrity constraints, we can basically convert any relational database
into an LLM benchmark.
And this supports continuous evaluation as the database changes. We can also support multimodal questions and also support various
prompt engineering techniques.
Well, I would ask you to kind of drill in on what you found in how ERBench
compares to other benchmark tests.
So we evaluated our benchmark on five domains and performed comprehensive analyses in terms of answer and rationale accuracies and hallucination rates using single, multi-hop, and multi-modal questions, and also performed prompt engineering and fine-tuning.
And what we found is that some LLMs, like GPT-4, are relatively aggressive and good at answering lots of questions. Other LLMs like
Gemini tend to be a bit more conservative and do not answer as many questions, but instead
hallucinate less as a result. So the key conclusion is that no LLM totally subsumes the other in all
aspects, which is the reason why we use multiple measures. And the key message we want to make is that overall, ERBench is effective in evaluating
any LLM's thought process by pinpointing critical keywords within the rationale.
Well, Jindong, back to you.
Research settings are one thing, but tell us how your work is significant in real-world settings,
and who does this impact most and how?
Relational databases, they are everywhere across various domains. Anyone can easily
get access from Google or from Kaggle, or even create them. You're targeting the domain
or a subject that one wants to test the model on. So taking into account that ER Bench is
the first work to utilize the relational database. So taking into account that ER Bench is the first work
to utilize the relational database
for generating large-language model hallucination benchmarks.
So this work will lead a new research direction
of integrating database design theories and techniques,
a long-studied field.
Database is very traditional, old and classic,
but they're still operating right now
into the large-language model field,
a recently emerging area.
Right. Well, Stephen, as we close, I assume there are still a few unanswered questions or unsolved problems in the field.
What do you propose to do about those and what's next on your research agenda?
Sure. So the big picture is that we basically propose the first work to properly evaluate the rationale of LLMs, right?
This is very important because LLMs are being used in our everyday lives and everyone has the question, is the LLM suitable for my task?
Can I benefit from the LLM? So it's very important to verify if the LLM knows what it's saying.
So I just mentioned that we use functional dependencies
to pinpoint critical keywords in the rationale. And we believe that's just the first step. It's
very effective, by the way. So you may have the question, is it enough to just look at like the
George Lucas within a long rationale? And it turns out 95% of the cases, it is actually effective. So we did human studies and also used GPT-JUG to verify that.
But these are factual questions, and there could be various other questions that require long answers, right, long rationales.
And so the important question is, can we also verify all the rest of the rationales, the complicated rationales as well?
And so in order to properly do that, we need a lot of technology.
So first we need to understand the rationales using NLP techniques, and we need to know
if it's properly answering the question, and so on and so forth.
And so we believe that there's a lot of opportunity to expand from that.
So we basically proposed an initial work towards this direction,
but we believe that there are many more interesting challenges that remain.
Well, Jin Don Wong and Stephen Huang, thanks for joining us today. And to our listeners,
thanks for tuning in. If you're interested in learning more about this paper, you can find a link at aka.ms forward slash abstracts.
You can also find it on Archive and on the NeurIPS website. And if you're at the NeurIPS
conference this week, go to the poster session and talk to the authors. See you next time on Abstracts. Thank you.