Microsoft Research Podcast - Abstracts: May 6, 2024
Episode Date: May 6, 2024Researcher Michel Galley explores how he and fellow researchers combined new and existing data to create MathVista, an open-source benchmark for measuring the mathematical reasoning capabilities of fo...undation models in scenarios that involve text and images.Read the paperGet the code & dataset
Transcript
Discussion (0)
Welcome to Abstracts,
a Microsoft Research podcast that puts
the spotlight on world-class research in brief.
I'm Dr. Gretchen Huizenga.
In this series,
members of the research community at Microsoft give us
a quick snapshot or a podcast abstract
of their new and noteworthy papers.
My guest today is Dr. Michelle Gally, or a podcast abstract of their new and noteworthy papers.
My guest today is Dr. Michel Ghaly, a senior principal researcher at Microsoft Research.
Dr. Ghaly is the co-author of a paper called MathVista,
Evaluating Mathematical Reasoning of Foundation Models in Visual Context.
Michel, thanks for joining us on Abstracts today.
Thank you for having me.
So I like to start with a distillation or sort of an elevator pitch of your research.
Tell us in just a couple sentences what problem or issue your paper addresses and why we should care about it.
So this paper is about this large foundation model in a multimodal setup.
So when the input to the model is actually not just text, but also text and images.
And then an example of a task that such a model would perform
is like input is maybe a mathematical question,
and then there's some visual support to that question,
let's say an image of a graph.
And then the model has to respond to something
related to that.
And why this is important, there has been a lot of work,
of course, on a large foundation model,
especially when it comes to reasoning tasks
like mathematical reasoning. A lot has focused more on written form. So MathVista is one of
the very first data set that has input that is both images and text.
Yeah, yeah. Well, reading your paper, it seems like this is an area that hasn't been studied
systematically. In fact, you actually say that and say that the field is largely unexplored.
But quickly tell us what has been done in this field,
and then tell us how your research addresses the proverbial gap in the literature.
Well, there has been a lot of work on vision and language in other problems,
like not just about reasoning. Maybe let me just mention why reasoning is important.
So one reason I think it's very interesting to evaluate this large language model in terms of reasoning skill is that we evaluate their capabilities beyond just
memorization. So as many of the listeners probably know, these large foundation models
are trained on large amounts of text, that is, public data from various sources. So when
you ask a question to a large foundation model, it could be the case in many cases that you
just memorize things it has in the data.
So what makes it interesting in terms of reasoning, the answer oftentimes is not there in the
data.
So it needs to develop this ability to connect the dots between various pieces of information
to come up with a new answer.
So the focus of our paper is really on mathematical reasoning, but it goes also a bit beyond that
because what is also represented in the data is also science questions and so on.
And so this reasoning part has largely focused
until MathVista on text-only modalities.
So it's one of the very first ones that combines text and images
in terms of evaluating these large foundation models.
So you asked about what was done before.
So yes, there has been a lot of work, text only, on reasoning. For
example, the mathematical question that's just based on text. And there have been a different
stream of work that was much more focused on vision. A lot of work has been on tasks such
as visual question answering, where basically you have an image and the question is about answer
question about this image. So yes, we're trying to fuse the two lines of research here. And that's one of the first
work that does that. Yeah. Well, let's talk about your methodology for a minute. Tell us how you
went about conducting this research and what methods did you use? Yes, sure. So that's a bit
different from a typical kind of machine learning paper because this focus on this work is really on
benchmarking on a data set. So the methodology is more about how we collect the data, process it. So they have two components to doing that. One was to look at
existing data that already combines vision and text. And there are existing data sets,
but are actually already fairly big, but that were not focused on reasoning. So we use this existing
data set and look for instances in the data that actually include some mathematical or science reasoning.
And so that part is leveraging existing data set.
But the important part is like we really want to carve out what was interesting pieces in
terms of reasoning.
And we had different stages of processing the data to identify the subset that was reasoning
based.
So one first step was basically to apply some automatic filter to determine whether or not
a given example, let's say something that is visual and text, is actually involved in
some mathematical reasoning.
So we have different strategies.
For example, if the answer is numerical, it's likely that it might be something mathematically
related.
But that's just the first stage.
And the second stage, we actually had a human annotator just certify that the selected data is actually of high quality.
So we do have an example of, oh, this is mathematical, and that's either mathematical or scientific and so on.
And that's one part of the effort.
The other part is that we realized while we collected the data, there are certain types of mathematical reasoning or related mathematical reasoning that were not represented in the data.
So we created three new data sets as parts of MathVista.
So when I said data set, it's more like think of MathVista
as like an aggregate of different types of data.
And we added three of them, three new types of data.
One is about what we call paper QA,
which is basically data that is collected
from scientific paper on archive.
And that had question asking about that paper
and that included some visual component for the paper,
typically a plot or a figure.
And then we had AccuTest, which is basically,
I mean, it's vaguely related, but basically,
it also kind of tried to see maybe more abstractive thinking
about maybe some input that is both text and visual.
And the final is about function QA, that is basically algebraic reasoning and function
plots and so on. The important part was actually to identify among vast amount of data what is
actually very interesting in terms of mathematical reasoning. So that part I think was quite a big
part of doing that work, finding existing data, but also creating new data.
Yeah, yeah. Well, my favorite part of a research paper is where it says,
and what we found was. So talk a little bit about your results. What did you find?
So we evaluated a wide variety of models, including GPT-4, CLOAT-2, GPT-4V, multimodal BART, and LAVA.
And we categorize them into three categories.
So one is text-only.
So basically, you take a model that is, by default, just text.
And we give it the text part of the question
and ask it to answer the question.
Of course, that's kind of a bit of a difficult task
because oftentimes we crucially build these questions
so that you have to rely on the vision part.
But that's for, you know, scientific investigation to know how well they can do.
And so that's one of the categories of model.
A different category is still text only, but that is given the detection from the image.
So on the image, we do OCR.
So we convert those words from images to text it's kind of an extension of the
text-based model except that what was images is translated into text and then the input of a
model is word only and that's a different category of model and the third one is basically truly
multimodal model and what we found I mean not surprisingly if kind of the one that was
most poorly is the one that is text onlyonly. The second is text plus OCR.
And then finally, the one that does best
is the multimodal like GPT-4V.
But while the ordering between these three categories
makes sense, it was a bit surprising
that maybe the gap between multimodal
and text plus OCR was not bigger.
Well, it's big, but maybe not as big as we were expecting.
So, for example, the best detection from the images model
achieved like 35% accuracy, while GPT-4V was 50%.
So it's a substantial gap, but not huge.
Right.
Just to clarify, you're saying OCR.
What does that stand for?
Optic Character Recognition.
So basically, it's the task of taking text, sometimes typed,
but sometimes written, and convert this into the actual text like you would have in a text file.
Right. Michelle, does any of this have to do with the difficulty of the math
problems that you present these models with? I mean, it seems to me similar to humans,
that the easier the problem,
the easier it would be for the machine. So at what level of math are we talking for these tests?
What's nice about MathVista is there's a continuum of different difficulties. So the spectrum is
quite broad, going from elementary school to more advanced concepts such as calculus. So it's quite
broad. So in the paper, we do have this kind more broken down by by level so the number i gave you like 50 is an aggregate over all the
difficulties but that's the goal there was really kind of compare different models but uh we do have
a fair amount of analysis in the appendix actually we have 100 pages of appendices of plenty of
analysis and so on so if people i mean i that. I saw the length of the paper,
and I'm going, what? It's a long paper. Well, research in the lab is one thing I always like
to say, but understanding real-world impacts is important too. So where is this work going to make
the most difference, and who does it help most at this point? Well, I think perhaps that's the main point of this kind of line of work in terms of reasoning,
is that when looking at this difficult problem, but a mathematical, actually, it's a way to kind of abstract away maybe more complex capabilities.
And I think while thinking just about mathematics might seem a bit narrow, I don't think that really is. It's more about seeing whether this model has the
ability to do kind of multi-step kind of processing of your input and think maybe somewhat intelligently
about a given problem. So we focus mostly on math. There's some science, but we would be very
interested, especially in future work, to kind of go beyond that. Okay. Well, let me press in a little bit there, because just say I'm a regular person using
a GPT model. Is your work more addressed upstream from that to the research community
to say, how do we get these models to be better so that downstream, people like me can be more
confident of the models? Yes. I would say at the moment, I mean, this line works perhaps more geared
towards somewhat more research community,
but I think it could be some seed
for researchers to think about
some application perhaps
that also requires some kind of
step-by-step reasoning,
but perhaps not going beyond math.
Yeah.
Michelle, if there was one thing
you want our listeners
to take away from this research,
kind of golden nugget, what would it be?
Well, I would say it's the challenging part
of this dataset.
I think that's what makes it math-based as standard
compared to other datasets.
By now, there are a few other vision and language datasets
and, of course, many that are more text-based.
And we've seen, for example,
some recent papers showing
that actually MavVista
remains one of the most challenging ones.
So I think it's probably going to stay around
for a while because of the difficulty
it represents.
So it's an open source of available data sets
that everybody can use
and very much encourage people to use it.
Is it on GitHub?
Yes, it's on GitHub.
So what's next on the research agenda for helping LLMs get better at math?
Michel, what are the big challenges in the field yet?
I mean, you've alluded to many of them already, sort of, but what's next on your research
agenda? Well, I would say what we found so far is these models are very good at processing the textual part of problems it's given to the model.
But you've had the equivalent in images actually harder somehow.
So I think a lot more work needs to be done in terms of vision capabilities, in terms of reasoning over images,
because the capabilities you see in text are actually quite advanced, whereas the equivalent in images doesn't seem that
good.
I mean, a fair disclaimer, my background is more on the text side.
So some of my colleagues on the paper are more on the vision side.
So maybe if a listener maybe ran into some of our co-workers at a conference, they might
want to talk to these vision people, because that's less of my background.
Well, and if you think about Venn diagrams,
you know, you've got people that are doing text,
people that are doing vision,
and then the people that are trying to do both to see how the worlds collide.
Well, Michel Galli, thanks for joining us today.
And to our listeners, thanks for tuning in.
If you want to read this paper,
you can find the link at aka.ms forward slash abstracts, or you can find it on archive. You can also read it on the website for the International Conference on Learning Representations, or ICLR. And if you happen to be at the ICLR conference this week, you can hear more about it there. See you next time on Abstracts.