Microsoft Research Podcast - Abstracts: January 25, 2024

Episode Date: January 25, 2024

On “Abstracts,” Jordan Ash & Dipendra Misra discuss the parameter reduction method LASER. Tune in to learn how selective removal of stored data alone can boost LLM performance, then sign up fo...r Microsoft Research Forum for more on LASER & related topics.Learn more:The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction | Publication, December 2023LASER code on GitHub

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to Abstracts, a Microsoft Research podcast that puts the spotlight on world-class research in brief. I'm Dr. Gretchen Huizenga. In this series, members of the research community at Microsoft give us a quick snapshot or a podcast abstract of their new and noteworthy papers.
Starting point is 00:00:24 Today, I'm talking to Dr. Dipendra Misra and Dr. Jordan Ash, both senior researchers at Microsoft Research. Drs. Misra and Ash are co-authors of a paper called The Truth is in There, Improving Reasoning in Language Models with Layer-Selective Rank Reduction, also known as LASER. This paper's been accepted at the International Conference on Learning Representations, or ICLR, in Vienna this year, and you can read a preprint of it now on Archive. Dipendra, Jordan, thanks for joining us on Abstracts.
Starting point is 00:00:54 Thanks for having us. Yeah, thanks for having this question. Dipendra, let's start with a general overview of this paper. In a few sentences, describe the issue or problem your work addresses, and perhaps more importantly, why we should care about it. Thanks, Gretchen. So as we know, large language models, also known as LLMs, have revolutionized both business and research in artificial intelligence. They are everywhere being used to solve a wide range of problems. So in our paper, we introduce an intervention which can be applied to any existing pre-trained large language models.
Starting point is 00:01:27 And our main purpose for introducing this is to see how it affects the performance of the LLMs and whether we can gain insight into how an LLM stores information in its parameters and how it uses that information to generate a response. And what our intervention does is that it performs a low-rank approximation of the parameters of the LLM. And the surprising discovery that our paper makes is that if we do this intervention correctly, then we can get significant improvement on various tasks for different LLMs. So that's the first part of the question. Tell me why I should care about it. So if you are a person who uses LLMs for solving any tasks, then you do care about performance on a given task.
Starting point is 00:02:12 So, for example, you could be using LLMs to generate an email, right, from a given description. Or you could be using an LLM to do question answering. And by applying our intervention, we can gain accuracy on the tasks that we care about. Well, let's stick with you, Dipendra, for a minute and talk about the field writ large. Almost all research owes a debt to some other research that went before. So tell us a bit about the related work in this field and how your work builds on or adds to it.
Starting point is 00:02:41 So the work that is most closely related to our laser paper is this growing body of work on understanding how knowledge is stored and edited inside a large language model. So these works don't apply the intervention that we do, but they were certainly inspirational for us for arriving at the intervention that we introduced. Another line of work which is very related is like adding a small number of parameters to improve the performance of the LLM on a given task. The most relevant work in this space is the LoRa paper, also known as the low-rank adaptation of large language models, which came from Microsoft. And what LoRa does, it adds a small number of additional parameters to an LLM and then fine
Starting point is 00:03:24 tunes it on a given task. And what our intervention called laser does is that it removes parameters instead of adding it. And another line of work which is also related is the work on model compression. So there are people who focus on breaking down the size of the models as much as possible while still retaining the performance more or less compared to the base model. And so these people are also focused on removing parameters, but they're coming at a different angle of like trying to reduce the memory footprint, while what we were doing is that we are less focused on the memory footprint. That's more like a side effect of it. And more like if I were to fiddle with this parameter of the LLM, then how does it affect the performance? And what can we learn by looking at the comparison? Like, okay, so if I remove this parameter,
Starting point is 00:04:09 I see the performance drop, then it means that these parameters are storing something about this type of task on which the performance is dropped. So I'll ask you one more question, Dipendra, before I pull Jordan into the conversation. And that would be about your methodology. How would you describe your approach to this project? and how did you conduct the research? So we started by analyzing the intervention laser on a particular LLM called GPD-J and evaluating its performance on this question answering data on counterfact. So our idea was like before trying this thing on a bunch of things, let's just understand this in one setting deeply and kind of build insights that we can then evaluate in other settings. And the reason we chose this setup was that the GPDJ large language model has its
Starting point is 00:04:55 training data publicly available. It's called the pile data set. And that allows us to do analysis with the training data. For example, is the performance dropping on data points, which are rarer or more frequent in the training data? And this is important because training data analysis is frequently omitted in existing LLM literature. And that's something we wanted to do. And the second reason is that the counterfact question answering data is both related to the prior work in this space. So there was a reason for choosing it, but also it has paraphrases of the same question. For example, it might ask like, who is the president of United States of America?
Starting point is 00:05:32 But it will also have paraphrases like, the president of the United States of America is, or the head of the government of United States of America is. And so it will have different variations of the same question. And then you can see if the LLM is able to get all of them right, or is it not robust to variations of the same question. And then you can see if the LLM is able to get all of them right,
Starting point is 00:05:46 or is it not robust to variations of the same question. And so we did analysis on this GPDJ and Counterfact dataset, and Jordan will talk more about what the results were. And so based on this rigorous analysis, we developed some insights as to what the intervention is doing. And then we evaluated these insights as to what the intervention is doing. And then we evaluated these insights on other settings. So then we tried like two other different large language models and evaluated on like multiple different data sets. And then we saw that the insights actually
Starting point is 00:06:15 hold more broadly. And finally, we also evaluated this in a non-text related task, right? Because the intervention could in principle be applied to any neural network. So we went after this reinforcement learning model, which solves a puzzle called Sokoban. And we also saw that if you apply this intervention correctly, then you can get some performance improvement. So it's not related to just large language models, although that was our main motivation. Well, Jordan, let's get your take on the last few questions here. As I've said before, the most interesting section of a research paper for me is the part where it says, and what we found was. So as a result of this research, what did you find? Were there outcomes that you expected or were there any surprises? I would say this paper is full of surprises. So as Dipendra was mentioning earlier,
Starting point is 00:07:07 the laser intervention removes information from a model. It doesn't add information to a model. And up until now, there's been a lot of work on pruning model parameters for a variety of reasons. But generally, these papers show that as parameters are removed from the model, performance just does not degrade. You can overall keep performance roughly the same, even with a fairly drastic reduction in model parameters. And those reductions are typically done across layers of the model. What we're showing here is surprising because we're showing if we do a very targeted intervention, maybe at only one layer of the model, we could actually get a big boost in performance rather than just keep it the same or something like this.
Starting point is 00:07:49 Hmm. So with those results in mind, Jordan, I'm curious about practical applications. How would you say this research makes an impact in real world situations? I know that Dipendra alluded to that earlier, but where is this most useful and who benefits most? I think the short sales pitch for this technique is that you could potentially improve the performance of a language model with no additional training at all, just by applying this intervention, which again just removes information from the model. So you don't need to have any extra data on hand to refine the model or to add new
Starting point is 00:08:26 information into it. The real world situations we're seeing a boost right now in laser is for like question answering or reasoning type tasks where there's like a concrete answer that corresponds to what you're asking the LLM rather than just a sort of like broad purpose generative task. So typically speaking, when you're dealing with LLMs, part of the issue is prompt engineering. And it's like my responsibility to be able to put the right words in it. So I'll get the best answer from the model, right? Are you saying that this helps me not have to be that good on the prompt engineer end versus what the model can interpret and do? I think prompt engineering still has a place in sort of eking out a good answer from a language
Starting point is 00:09:12 model. But given a fixed prompt, this intervention seems to offer an improved accuracy over not intervening at all and applying the same prompt. So Jordan, I often think of an abstract as a sort of appetizer for a research paper, but let's distill it even further. If there was one thing, sort of an amuse-bouche, if you will, that you want our listeners to take away from this work, what would it be? For me, I like this idea of how, you know, typically if you want to get a model to perform better, you would take that model off the shelf and you would refine it on data related to the task at hand. And that might take the form of refining all the parameters or doing some low-rank LoRa type thing that Dipendra alluded to earlier. Here, we counterintuitively show that sometimes just carefully removing
Starting point is 00:10:05 information from the model can have a positive effect as well. And this is great news because refining a model requires a lot of new target domain data to be available, but removing information from the model doesn't necessarily have that same constraint. Well, finally, let's talk a little bit about the future, Jordan, and I'll have you close the show for us. What unanswered questions or ongoing research challenges do you see here? And what's next maybe on your research agenda? Yeah, I think there's a lot of exciting future work for this project. I think for one, as a practical matter, there's this question of just what's the best way to find the best laser intervention. Laser targets a specific layer of the model, and then it finds the extent by which it should be rank reduced.
Starting point is 00:10:56 That search procedure is kind of expensive. Right now, we're doing it in a sort of exhaustive way. But also, it seems to be beneficial to apply laser at multiple layers of the model. And that makes the search procedure sort of combinatorially explode. So finding out the best way to compose these interventions, I think, is an important area of future research. And then just sort of less on the practical side, I think there are all these questions related to just why does this work at all? Like, why is it helpful to remove information from the model? And, you know, I think there are some rough ideas we have about this.
Starting point is 00:11:39 For example, when you're training a model on lots and lots of data, you know, it's not all created equally. Some of it might be noisy or low quality and some of it might be high quality. And maybe it's better to remove those samples at training time to get a better model. So I guess there's this question of, is pruning the model using a laser-type intervention roughly equivalent to pruning the training data in a way to make it more favorable for eliciting a high-quality model? And again, like Dipendra alluded to earlier, this LORA procedure, which does something that very much complements laser and is often used to add information to a model. Is it possible that LORA is actually not just adding information, but also removing
Starting point is 00:12:17 information from the model? And perhaps that's one reason why laser seems to be so effective. So lots of questions. I would say so, yeah. Yeah. Well, Dipendra Misra, Jordan Ash, thanks for joining us today. And to our listeners, thanks for tuning in. Again, you can find a link to this paper at aka.ms forward slash abstracts or on archive. And I'll also add that Dipendra will be speaking about this work at the upcoming Microsoft Research Forum,
Starting point is 00:12:46 and you can register for this series of events at researchforum.microsoft.com. See you next time on Abstracts.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.