Microsoft Research Podcast - Abstracts: May 6, 2024

Starting point is 00:00:00 Welcome to Abstracts, a Microsoft Research podcast that puts the spotlight on world-class research in brief. I'm Dr. Gretchen Huizenga. In this series, members of the research community at Microsoft give us a quick snapshot or a podcast abstract of their new and noteworthy papers.

Starting point is 00:00:24 My guest today is Dr. Michelle Gally, or a podcast abstract of their new and noteworthy papers. My guest today is Dr. Michel Ghaly, a senior principal researcher at Microsoft Research. Dr. Ghaly is the co-author of a paper called MathVista, Evaluating Mathematical Reasoning of Foundation Models in Visual Context. Michel, thanks for joining us on Abstracts today. Thank you for having me. So I like to start with a distillation or sort of an elevator pitch of your research. Tell us in just a couple sentences what problem or issue your paper addresses and why we should care about it.

Starting point is 00:01:10 So this paper is about this large foundation model in a multimodal setup. So when the input to the model is actually not just text, but also text and images. And then an example of a task that such a model would perform is like input is maybe a mathematical question, and then there's some visual support to that question, let's say an image of a graph. And then the model has to respond to something related to that.

Starting point is 00:01:37 And why this is important, there has been a lot of work, of course, on a large foundation model, especially when it comes to reasoning tasks like mathematical reasoning. A lot has focused more on written form. So MathVista is one of the very first data set that has input that is both images and text. Yeah, yeah. Well, reading your paper, it seems like this is an area that hasn't been studied systematically. In fact, you actually say that and say that the field is largely unexplored. But quickly tell us what has been done in this field,

Starting point is 00:02:06 and then tell us how your research addresses the proverbial gap in the literature. Well, there has been a lot of work on vision and language in other problems, like not just about reasoning. Maybe let me just mention why reasoning is important. So one reason I think it's very interesting to evaluate this large language model in terms of reasoning skill is that we evaluate their capabilities beyond just memorization. So as many of the listeners probably know, these large foundation models are trained on large amounts of text, that is, public data from various sources. So when you ask a question to a large foundation model, it could be the case in many cases that you just memorize things it has in the data.

Starting point is 00:02:46 So what makes it interesting in terms of reasoning, the answer oftentimes is not there in the data. So it needs to develop this ability to connect the dots between various pieces of information to come up with a new answer. So the focus of our paper is really on mathematical reasoning, but it goes also a bit beyond that because what is also represented in the data is also science questions and so on. And so this reasoning part has largely focused until MathVista on text-only modalities.

Starting point is 00:03:13 So it's one of the very first ones that combines text and images in terms of evaluating these large foundation models. So you asked about what was done before. So yes, there has been a lot of work, text only, on reasoning. For example, the mathematical question that's just based on text. And there have been a different stream of work that was much more focused on vision. A lot of work has been on tasks such as visual question answering, where basically you have an image and the question is about answer question about this image. So yes, we're trying to fuse the two lines of research here. And that's one of the first

Starting point is 00:03:45 work that does that. Yeah. Well, let's talk about your methodology for a minute. Tell us how you went about conducting this research and what methods did you use? Yes, sure. So that's a bit different from a typical kind of machine learning paper because this focus on this work is really on benchmarking on a data set. So the methodology is more about how we collect the data, process it. So they have two components to doing that. One was to look at existing data that already combines vision and text. And there are existing data sets, but are actually already fairly big, but that were not focused on reasoning. So we use this existing data set and look for instances in the data that actually include some mathematical or science reasoning. And so that part is leveraging existing data set.

Starting point is 00:04:28 But the important part is like we really want to carve out what was interesting pieces in terms of reasoning. And we had different stages of processing the data to identify the subset that was reasoning based. So one first step was basically to apply some automatic filter to determine whether or not a given example, let's say something that is visual and text, is actually involved in some mathematical reasoning. So we have different strategies.

Starting point is 00:04:53 For example, if the answer is numerical, it's likely that it might be something mathematically related. But that's just the first stage. And the second stage, we actually had a human annotator just certify that the selected data is actually of high quality. So we do have an example of, oh, this is mathematical, and that's either mathematical or scientific and so on. And that's one part of the effort. The other part is that we realized while we collected the data, there are certain types of mathematical reasoning or related mathematical reasoning that were not represented in the data. So we created three new data sets as parts of MathVista.

Starting point is 00:05:27 So when I said data set, it's more like think of MathVista as like an aggregate of different types of data. And we added three of them, three new types of data. One is about what we call paper QA, which is basically data that is collected from scientific paper on archive. And that had question asking about that paper and that included some visual component for the paper,

Starting point is 00:05:50 typically a plot or a figure. And then we had AccuTest, which is basically, I mean, it's vaguely related, but basically, it also kind of tried to see maybe more abstractive thinking about maybe some input that is both text and visual. And the final is about function QA, that is basically algebraic reasoning and function plots and so on. The important part was actually to identify among vast amount of data what is actually very interesting in terms of mathematical reasoning. So that part I think was quite a big

Starting point is 00:06:22 part of doing that work, finding existing data, but also creating new data. Yeah, yeah. Well, my favorite part of a research paper is where it says, and what we found was. So talk a little bit about your results. What did you find? So we evaluated a wide variety of models, including GPT-4, CLOAT-2, GPT-4V, multimodal BART, and LAVA. And we categorize them into three categories. So one is text-only. So basically, you take a model that is, by default, just text. And we give it the text part of the question

Starting point is 00:06:57 and ask it to answer the question. Of course, that's kind of a bit of a difficult task because oftentimes we crucially build these questions so that you have to rely on the vision part. But that's for, you know, scientific investigation to know how well they can do. And so that's one of the categories of model. A different category is still text only, but that is given the detection from the image. So on the image, we do OCR.

Starting point is 00:07:21 So we convert those words from images to text it's kind of an extension of the text-based model except that what was images is translated into text and then the input of a model is word only and that's a different category of model and the third one is basically truly multimodal model and what we found I mean not surprisingly if kind of the one that was most poorly is the one that is text onlyonly. The second is text plus OCR. And then finally, the one that does best is the multimodal like GPT-4V. But while the ordering between these three categories

Starting point is 00:07:54 makes sense, it was a bit surprising that maybe the gap between multimodal and text plus OCR was not bigger. Well, it's big, but maybe not as big as we were expecting. So, for example, the best detection from the images model achieved like 35% accuracy, while GPT-4V was 50%. So it's a substantial gap, but not huge. Right.

Starting point is 00:08:17 Just to clarify, you're saying OCR. What does that stand for? Optic Character Recognition. So basically, it's the task of taking text, sometimes typed, but sometimes written, and convert this into the actual text like you would have in a text file. Right. Michelle, does any of this have to do with the difficulty of the math problems that you present these models with? I mean, it seems to me similar to humans, that the easier the problem,

Starting point is 00:08:45 the easier it would be for the machine. So at what level of math are we talking for these tests? What's nice about MathVista is there's a continuum of different difficulties. So the spectrum is quite broad, going from elementary school to more advanced concepts such as calculus. So it's quite broad. So in the paper, we do have this kind more broken down by by level so the number i gave you like 50 is an aggregate over all the difficulties but that's the goal there was really kind of compare different models but uh we do have a fair amount of analysis in the appendix actually we have 100 pages of appendices of plenty of analysis and so on so if people i mean i that. I saw the length of the paper, and I'm going, what? It's a long paper. Well, research in the lab is one thing I always like

Starting point is 00:09:33 to say, but understanding real-world impacts is important too. So where is this work going to make the most difference, and who does it help most at this point? Well, I think perhaps that's the main point of this kind of line of work in terms of reasoning, is that when looking at this difficult problem, but a mathematical, actually, it's a way to kind of abstract away maybe more complex capabilities. And I think while thinking just about mathematics might seem a bit narrow, I don't think that really is. It's more about seeing whether this model has the ability to do kind of multi-step kind of processing of your input and think maybe somewhat intelligently about a given problem. So we focus mostly on math. There's some science, but we would be very interested, especially in future work, to kind of go beyond that. Okay. Well, let me press in a little bit there, because just say I'm a regular person using a GPT model. Is your work more addressed upstream from that to the research community

Starting point is 00:10:34 to say, how do we get these models to be better so that downstream, people like me can be more confident of the models? Yes. I would say at the moment, I mean, this line works perhaps more geared towards somewhat more research community, but I think it could be some seed for researchers to think about some application perhaps that also requires some kind of step-by-step reasoning,

Starting point is 00:10:59 but perhaps not going beyond math. Yeah. Michelle, if there was one thing you want our listeners to take away from this research, kind of golden nugget, what would it be? Well, I would say it's the challenging part of this dataset.

Starting point is 00:11:14 I think that's what makes it math-based as standard compared to other datasets. By now, there are a few other vision and language datasets and, of course, many that are more text-based. And we've seen, for example, some recent papers showing that actually MavVista remains one of the most challenging ones.

Starting point is 00:11:32 So I think it's probably going to stay around for a while because of the difficulty it represents. So it's an open source of available data sets that everybody can use and very much encourage people to use it. Is it on GitHub? Yes, it's on GitHub.

Starting point is 00:11:48 So what's next on the research agenda for helping LLMs get better at math? Michel, what are the big challenges in the field yet? I mean, you've alluded to many of them already, sort of, but what's next on your research agenda? Well, I would say what we found so far is these models are very good at processing the textual part of problems it's given to the model. But you've had the equivalent in images actually harder somehow. So I think a lot more work needs to be done in terms of vision capabilities, in terms of reasoning over images, because the capabilities you see in text are actually quite advanced, whereas the equivalent in images doesn't seem that good.

Starting point is 00:12:30 I mean, a fair disclaimer, my background is more on the text side. So some of my colleagues on the paper are more on the vision side. So maybe if a listener maybe ran into some of our co-workers at a conference, they might want to talk to these vision people, because that's less of my background. Well, and if you think about Venn diagrams, you know, you've got people that are doing text, people that are doing vision, and then the people that are trying to do both to see how the worlds collide.

Starting point is 00:12:55 Well, Michel Galli, thanks for joining us today. And to our listeners, thanks for tuning in. If you want to read this paper, you can find the link at aka.ms forward slash abstracts, or you can find it on archive. You can also read it on the website for the International Conference on Learning Representations, or ICLR. And if you happen to be at the ICLR conference this week, you can hear more about it there. See you next time on Abstracts.

Microsoft Research Podcast - Abstracts: May 6, 2024

Researcher Michel Galley explores how he and fellow researchers combined new and existing data to create MathVista, an open-source benchmark for measuring the mathematical reasoning capabilities of fo...undation models in scenarios that involve text and images.Read the paperGet the code & dataset

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Microsoft Research Podcast - Abstracts: May 6, 2024

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.