Microsoft Research Podcast - Abstracts: July 18, 2024

Episode Date: July 18, 2024

Senior Researcher Arindam Mitra introduces AgentInstruct. Using raw data sources, the automated multi-agent framework can create diverse, high-quality synthetic data at scale for the post-training of ...small and large language models.Read the paper

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to Abstracts, a Microsoft Research podcast that puts the spotlight on world-class research in brief. I'm Dr. Gretchen Huizenga. In this series, members of the research community at Microsoft give us a quick snapshot or a podcast abstract of their new and noteworthy papers.
Starting point is 00:00:24 I'm here today with Dr. Arundam Mitra, a senior researcher at Microsoft Research and the lead researcher for Microsoft's Orca project. Dr. Mitra is co-author of a paper called Agent Instruct, Toward Generative Teaching with Agentic Flows. Arundam, it's a pleasure to have you on Abstracts today. Thank you, Gretchen. So let's start with a brief overview of your paper. What problem does your research address and why does it matter?
Starting point is 00:00:52 So the post-training phase is very important for language models. You can really improve the model a lot by creating high-quality synthetic data. The problem is, however, though, high-quality synthetic data creation requires lots of human effort and expertise. The problem that you're trying to tackle is how do you reduce human effort? How can you create high quality data with really low amount of human effort? When you have a language model, and let's say you want to apply it somewhere, you might have trailed a generic model before, which could be small or big, doesn't matter. After that, you can specialize it on the domain that you're looking for. And when you want to do that, to make it really fast, this particular
Starting point is 00:01:37 process, it's best if you go for synthetic data. If you have a way to actually generate very high quality synthetic data, you can fast track this part of specialization process, not only single model. So this year, you're going to see a lot more multi-agent models. And when you are trying to build this multi-agent models, you're fearing like it might increase the cost too much, the latency too much. So it's also very much important that you have a multi-agent system and you can sort of replace some of those agents with specialized small models. And when you're trying to address these goals, you want this process to be something which you know works fast. So that's where we're trying to make sure we have a very good way to create synthetic
Starting point is 00:02:21 data for your specific need. No research exists in a vacuum and most of it fills some kind of a gap. So tell us what's already been done in this field and how this work is building on it. So previously, actually, we have seen that in post-training, the more data you have, the better the performance goes for the model you're training. So what we wanted to test is how much we can scale and what happens if we scale a lot and lot. But we didn't have the tools for it.
Starting point is 00:02:52 So the other approaches people previously used was you had a small set of data and how do we expand this data set into much larger and larger amount of data. That's where people are mostly focusing. But it's not that easy to create that initial seats that you need to be very expert the way that we're doing is actually rather you define what you want to create like okay you want to create tool use data so you say okay i have a bunch of tools and i am looking for data in these scenarios where someone can just come give me a description and then maybe that person interact with the AI
Starting point is 00:03:25 to figure out how to get the job done. It's not a one-step thing. And maybe you also have a setting where it's more like an app developer. You have a bunch of APIs in your phone. You just want to figure out which one is best for the user request, which came through voice command. So different scenarios could be there.
Starting point is 00:03:39 So what we're saying, okay, we are not going through the method where you have to come up with your initial own seed data. Then we explain it. It's more like you define what you want to do. It's much more abstract. And then we are sort of automating the effort of data creation. So this setting actually of synthetic data creation, we are referring it as generative teaching. And that's where we're sort of differing. So previously it was more like expansion. And now we are trying from specification to the data that you need. Gotcha. Well, talk a little bit more about your methodology and how you went about conducting
Starting point is 00:04:11 this research. So first of all, what we are proposing actually is a multi-agent solution. So you start with first describing what you really need. So you describe in detail, like, I need data for this specific skill or this specific scenario then what we do is like okay you have uh some unstructured data or raw data like text documents or code files that you gather from web with permissible license or use something that you own we don't care much about what the content is really so it's more like we got some random stuff some random content and then we'll guide you how to convert this random something which is not meaningful you
Starting point is 00:04:50 into something which is meaningful for your dedication for example like if you are creating data to teach how to use apis you might think about you need lots of apis and how do you get this api so what you are saying is like we can take something like code and we'll have agents which will convert these raw code files into a list of APIs which is more like a library so you create automatically this input that is very meaningful for the recreation and then once we have that we have basically the seed instruction creation step based on your specification like where you want to create data for so you have all these different scenarios and we have multiple agents creating data for different scenarios.
Starting point is 00:05:26 And then the last step is actually what we call refinement step. So it's more like whatever data you created, we'll go through them and we'll make them better and better, improve the quality, improve the complexity, improve the trickiness, we'll teach when not to answer, et cetera, et cetera. So we'll make sure we cover the whole space.
Starting point is 00:05:43 So by changing the stochastic seed, we we'll make sure we covered the whole space. So by changing the stochastic seed, we're trying to cover the entire possible data space. So that's the key thing. The way we sort of conducted this research is actually we defined 17 skills, skills, meaning reading comprehension, tool use, text modification, content creation, rag. We have like list of 17 skills, conversation, and then we created one multi-agent flow for each of the skills and we generate data. So one key thing I want to highlight is like this work compared to other work, it was not benchmark driven.
Starting point is 00:06:16 We want to teach a skill. We don't care which benchmark we are going to evaluate it on. So we define the skill like tool use means this to us, reading comprehension means this to us, text modification means this to us, text modification means this to us and then we sort of generate the data to teach everything for that skill. And then what we did, we created 22 million instructions and we had previously in Orca series, we had 3 million around instructions. So the 25 million is what we sort of have at the end and that's where we actually trained
Starting point is 00:06:43 a minstrel model as of now. And we went to measure like how much we improved the menstrual model by this post-training. Moving from methods to findings, I always look forward to the part of the research paper that finishes the sentence. And what we found was, so give us a quick overview of your results.
Starting point is 00:07:01 What did you find? Yes. So the results were actually very exciting for us. So Mistral 7b was our main sort of baseline because that's where we're trying to showcase like how much improvement we're getting. On the other side, we have like frontiers models, chat GPT, GPT-4. We want to also measure how far we are from those frontier models. So that's sort of our evaluation setup. So on average, actually we got like 20% performance gain over the Mistral and we evaluated that across 14 benchmarks that test reasoning,
Starting point is 00:07:31 content creation, instruction following, format following, et cetera. But what was more important to us was to do a skill specific evaluation because we're trying to teach certain skills and we had like 17 skills, as you mentioned earlier. So for example, like if you're focusing on reading comprehension as a skill, we took LSAT, SAT and DROP and many other benchmarks. So we created a collection of reading comprehension specific benchmarks and there we are observing like 20% improvement over Mistral and what it means like we're actually achieving GPT-4 level performance. Similarly if I'm focusing on math skill, there are many data sets which test like elementary math,
Starting point is 00:08:07 high school math, college level math. And we improved actually across all these different levels of math. So we see from 40% to 150% of improvement on different benchmarks of math. So it was more like what we wanted to see. We're not optimizing for a particular benchmark. We wanted to optimize the skill and that's what you're observing. So you're observing improvement in math across all these levels from elementary to high school to college to middle school, etc. Everything. The same goes for RAG as well. We are observing on RAG skill 92%
Starting point is 00:08:41 around improvement over Mistral. The format following numbers are pretty interesting to us. So format following is very important for SLMs. You want to make these models practical. You want to make sure that you follow the format so you can pass the result. And we were able to take Mistral beyond Gemini Pro. So that was a very strong performance from the post training that we did. For summarization actually we were able to reduce the hallucination rate by 31% while acting on the GPT-4 level quality. So overall, all these results were sort of highlighting that the methodology that we have, which we're calling
Starting point is 00:09:14 agent-destruct, is very promising. I think it's important to get practical and talk about real-world impact. So tell us who you think this research will benefit most and why. Yeah. So again, the model builders will sort of find it most beneficial. So the significance of our work actually lies in the way we are trying to revolutionize the language model development through scalable, low effort synthetic creation. And the scalable and low-effort is sort of the key thing. We have shown that we can create very high-quality data. That's what the numbers are telling us. We want to mention that this is very scalable and low-effort, and that's what we think might help the most for model builders.
Starting point is 00:10:02 So, Arundam, let's borrow a phrase from the machine learning lexicon and go for a little one-shot learning here. If you had to boil down why your work is important, what's the one thing you want our listeners to take away from this research? The key takeaway would be like the agent-instruct method enables the generation of vast, diverse, and high-quality synthetic data with very minimal human input. So that's one thing I would like to remember from this paper. So as we close, talk briefly about the limitations that you encountered from this project and directions for future research. What are the outstanding challenges in this field? And what's on your research agenda to overcome them? Yes. So we're exploring further automation, but apart from making this data creation more
Starting point is 00:10:51 automated and less human involvement needed, we're trying to focus on two other aspects. One is automated model debugging and another is automated model repairing. So now that we have the ability to generate data for a particular skill, let's say math, for model debugging what we need is basically a error handler. Like if something we can plug in which takes the question and the answer coming from a defined model and verifies if the answer is correct or not. So that's the part we are working on right now, figuring out this error handler. And the second aspect is repairing. So once we have working on right now, figuring out this error handler. And second aspect is repairing.
Starting point is 00:11:27 So once we have the error, we figured out, okay, this is where the model is struggling. How can we give feedback or how can I get more knowledge so it can basically correct those errors? So those are something we're working on right now. Well, Arundam, Mitra, thanks for joining us today. And to our listeners, thanks for tuning in. If you want to read this paper, you can find a link at aka.ms forward slash abstracts, or you can find a preprint on Archive. See you next time on Abstracts.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.