Microsoft Research Podcast - Abstracts: December 11, 2023

Starting point is 00:00:00 Welcome to Abstracts, a Microsoft Research podcast that puts the spotlight on world-class research in brief. I'm Dr. Gretchen Huizenga. In this series, members of the research community at Microsoft give us a quick snapshot or a podcast abstract of their new and noteworthy papers.

Starting point is 00:00:24 Today, I'm talking to Dr. Alessandro Sordoni, a principal researcher from Microsoft Research. Dr. Sordoni is co-author of a paper titled Joint Prompt Optimization of Stacked LLMs Using Variational Inference, and this paper, which was accepted for the 2023 Conference on Neural Information Processing Systems, or NURIPS, is available now on Archive. Alessandro, thanks for joining us on Abstracts. Hi, Gretchen. Thank you for having me. So in a few sentences, tell us about the issue or problem that your research addresses and why we should care about it. So in this paper, our starting points are large language models. And to make large language

Starting point is 00:01:07 models solve tasks, one of the ways that is currently used is to prompt them. By prompting them means just giving instruction to them. And hopefully, by joining instruction and the input of the task, the language model can solve the task, following the rules specified in the instructions. And there has been some approaches already in the literature to actually infer what that instruction is without human intervention. And in this paper, we operate in that space, which is called kind of automatic prompt engineering. And our specific problem is to, one, how to actually infer those prompts for a language model. And two, what happens if actually the output of that language model gets into another language model, and both language model needs prompt to operate. And so basically,

Starting point is 00:02:00 we give sort of an algorithm to solve that joint prompt optimization. That's why it's called joint. So what's the underlying issue there that we should care about as potential users of this technology? There are some problems that cannot be just solved by kind of one instructions or rule, I would say. But they necessitate some sort of higher level reasoning or some sort of decomposition. And in that sense, it would maybe be useful to actually have multiple calls to the LLM, where each call is modulated by a different instruction. So the first instruction could be something very general, for example, decompose or visualize the problem into a different language that is formulated in. And the second goal is now recompose this visualization that you have produced

Starting point is 00:02:50 to solve the problem itself. And so basically in that context, you can think about this as kind of augmenting the computational power of the language model by splitting the one goal in multiple goals. Well, going a little deeper on the work that this builds on, all research kind of gets a prompt, no pun intended, from previous work. So how does your work build on and or differ from what's been done previously in this field? I would say that our work started more with this intuition that LLMs are just kind of black box computation units. Now this sort of black box can accept input as input language. The computation is motivated by an instruction in the output language.

Starting point is 00:03:37 So you can stack these layers, right? So if the weight of this language layer now are the instructions and you can stack them together, how can you optimize them? Right. And then we start to think, OK, but this is very related to kind of automatic prompt optimization. The overall kind of prompt engineering and prompt optimization approaches right now work by proposing some prompts and accepting some prompts. So we did some modifications with respect to how we propose new prompts to language model and how do we evaluate and accept those that work given some task inputs and outputs. Our goal in the future, I would say in the near future,

Starting point is 00:04:18 is going to be to basically integrate optimization that can really express arbitrary graphs of LLM calls right now. But in our paper, we started with the first step, which is, okay, say that I just have two calls. Can I just optimize prompts for that very simple graph? And we proposed an algorithm to do so. So basically, I guess our main contribution is one, getting a better prompt optimizer for one layer and also devising an algorithm that works for two layers right now. And that can be extended to multiple layers. But that's also an engineering problem that needs to be done. Yeah, we've got to get the engineering in there. Well, listen, let's keep going on this because it sounds like you're talking about methodology and how you conducted this research.

Starting point is 00:05:05 So expand a little bit on what you did actually to experiment in this arena. Yeah. So I think that really the birth of this paper started from this kind of view of these language models as layers modulated by instructions that can be stacked upon each other. From there, we said, okay, what can we do with this basically? Some of us worked on datasets that could be interesting for this new methodology, I would say, or architecture. Basically, one question was, how do you go forward to actually test if this works

Starting point is 00:05:46 in any way? And so we tried to select some datasets that were more of natural language tasks, for example sentiment classification, and some datasets that were more about reasoning tasks. And our hunch was that basically stacking multiple layers together would help more in those tasks that would require some sort of decomposition or reasoning. And for the reasoning task, we worked with this big bench hard setting. And so parallel to that, there were some of us that worked, for example, myself in the optimization part, really in the algorithm part. And at first, we tried to do some sort of backpropagation. But I quickly saw that there were some sort of issues with that, probably empirically issues.

Starting point is 00:06:38 And so we tried to actually have a more formal understanding of this optimization algorithm by recurring to variational inference, basically. So basically, to understand actually the first layer as producing some text and considering this text as a latent variable. When you open that box, it links also in your head to all a bunch of kind of relating works in the literature that have studied this problem very, very thoroughly. And so you can use those techniques into this context.

Starting point is 00:07:09 Interesting. So what were the results of this research? What did you find? So what we found was that indeed the tasks in which this approach seemed to help the most are the tasks that require this sort of decomposition and reasoning. The first thing that was really, really kind of cool, it was that kind of you can go a long way in improving the performance of this language model by accurate prompt optimization. Because in some models, prompt optimization can be understood as kind of really tweaking the models towards solving the task. But in some other tasks, actually, when humans write prompt, they tend to maybe underspecify the prompt or tend to basically be not very clear how to instruct the model.

Starting point is 00:07:55 So the model has to do a lot of work to understand what the human really wants to say to them. And so basically, this sort of prompt optimization acts as a sort of translator where it formulates a prompt that more comprehensively describes the task and more comprehensively contains some rules to solve the task. So it was very interesting to me, that kind of level of abstraction that was sort of required and needed in the prompt to really solve this task very, very well. The other finding is that this problem is very hard. It's very tricky to optimize to prompt this type of optimization, because this type of optimization

Starting point is 00:08:32 doesn't really follow a gradient direction, like in deep neural networks. It's basically a sort of trial and error. And this trial and error is very finicky. There's a lot of problems there. But I feel like I'm hopeful in the sense that this paper allowed us, I think, to own in some very specific problem that if we solve them, we can make the problem much easier. Let's talk for a second about real-world impact of this research.

Starting point is 00:09:04 Let's extrapolate out from the lab and move into life. Who benefits from this most and how do they benefit? I think that, as I said before, like these automatic prompt optimization methods could benefit, I think, a large audience or a large amount of users, I would say, because they could be understood as a sort of translator between the user needs and what the LLM can do. For example, one effort here in Montreal that was led by my colleagues was kind of building this sort of interactive agent

Starting point is 00:09:46 that would, through interaction with the user, form a prompt, but interactively. So for example, in DLN, like in our paper, we assume that we have a task and we do not have input or interaction with the user, right? But in more realistic scenarios, you might want to actually instruct your model to do something by some sort of active learning process where the model actually proposes you

Starting point is 00:10:10 whether what it did was favorable or desirable or not, and the user can actually interact with that output, right? For the multilayer case, my hope is that that would be useful to build and optimize these large sort of graphs of LLM calls. I want to take a second here to spell out some acronyms. You've referred to DLNs, and I don't think our audience might know what that means. I'm assuming they know LLM means large language model. That's sort of in the parlance. But talk a little bit about what that other acronym is. Yeah, sorry.

Starting point is 00:10:47 I didn't mention this. So DLN is basically how we refer to these architecture that are composed of language model layers. So DLN is spelled as deep language network. People are free to use this name or not. I'm not a big fan of imposing acronyms on the word, but that's a shorter version of it. So, yeah, so it's really the idea that a language model is a layer in this hierarchy, and the layer accepts as input a text, as outputs a text, and really is modulated by an instruction or prompt that we want to learn.

Starting point is 00:11:27 And so the DLN is a deep language network, and it sort of acts as a deep neural network, but using language models as your layer. Exactly. Okay. Yes. So this is a question I ask everyone, and it's sort of like,

Starting point is 00:11:44 how could you boil this down to one little takeaway if you're standing on an elevator with somebody and they say, what do you do, Alessandro? So if there's one thing you want people to take really as a class, I would say, of probability distributions, and that they are modulated by these prompts. And so basically, once you have that, once a language model just defines a p over sentences given some prompt, you can apply a lot of algorithms with those models. You can apply algorithms that resemble to EM, expectation maximization, or, I mean, we apply the form of that with variational inference, but maybe kind of it could open the path for other types

Starting point is 00:12:39 of usages where kind of these are just very, very powerful probability distributions over these sentences that are considered as latent variable. I hope that our paper can show a more or less practical kind of implementation of that idea, and that basically, if you have to optimize, for example, prompts with one or two layers, you can definitely try our approach.

Starting point is 00:13:04 LESLIE KENDRICK- Well, finally, and we've been talking about this kind of already, but there seem to be some unresolved problems in the area. What are researchers like you need to be looking at in order to solve those? Sort of what's next on the research agenda, whether it's you or other researchers in this field? So let me try to answer by something that really excites me now. What we are doing is that we are producing text, right, with the language model. But we are producing this text in such a way that it helps to solve a problem. And basically, this variational inference method and kind of framework gives us a way of understanding what does it mean to be a good text?

Starting point is 00:13:47 Like what does it mean to be a good latent variable or useful latent variable? Right. What does it mean to produce good data? So for example, these big models kind of are really data creators, like generative AI, right? But can we actually teach them to produce data such that this data can be helpful to solve tasks or to condition those same models to solve a task?

Starting point is 00:14:12 Right. And what are the objective functions that promote the production of this useful data? What useful means from a mathematical perspective. I think that apart from the prompt optimization angle, I feel like DLN to me kind of opened a little bit my spirit into kind of investigating ways of understanding what does it mean for some generated text to be useful to solve a task, I would say. Alessandro Sordoni, thanks for joining us today

Starting point is 00:14:44 and thanks to our listeners for tuning in. If you're interested in learning more about this work, you can find a link to the paper at aka.ms forward slash abstracts, or you can find it on Archive. See you next time on Abstracts. Thank you.

Microsoft Research Podcast - Abstracts: December 11, 2023

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.