Microsoft Research Podcast - Abstracts: December 11, 2023
Episode Date: December 11, 2023Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations... about new and noteworthy achievements.In this episode, Principal Researcher Alessandro Sordoni joins host Gretchen Huizinga to discuss “Joint Prompt Optimization of Stacked LLMs using Variational Inference.” In the paper, which was accepted at the 2023 Conference on Neural Information Processing Systems (NeurIPS), Sordoni and his coauthors introduce Deep Language Networks, or DLNs, an architecture that treats large language models as layers within a network and natural language prompts as each layer’s learnable parameters.Read the paper
Transcript
Discussion (0)
Welcome to Abstracts,
a Microsoft Research podcast that puts
the spotlight on world-class research in brief.
I'm Dr. Gretchen Huizenga.
In this series,
members of the research community at Microsoft give us
a quick snapshot or a podcast
abstract of their new and noteworthy papers.
Today, I'm talking to Dr. Alessandro Sordoni, a principal researcher from Microsoft Research.
Dr. Sordoni is co-author of a paper titled Joint Prompt Optimization of Stacked LLMs
Using Variational Inference, and this paper, which was accepted for the 2023 Conference
on Neural Information Processing Systems, or NURIPS, is available now on Archive.
Alessandro, thanks for joining us on Abstracts.
Hi, Gretchen. Thank you for having me.
So in a few sentences, tell us about the issue or problem that your research addresses and why we should care about it.
So in this paper, our starting points are large language models. And to make large language
models solve tasks, one of the ways that is currently used is to prompt them. By prompting
them means just giving instruction to them. And hopefully, by joining instruction and the input
of the task, the language model can solve the task, following the rules specified in the
instructions. And there has been some approaches already in the literature to actually infer what
that instruction is without human intervention. And in this paper, we operate in that space,
which is called kind of automatic prompt engineering. And our specific problem is to, one, how to actually infer those
prompts for a language model. And two, what happens if actually the output of that language model
gets into another language model, and both language model needs prompt to operate. And so basically,
we give sort of an algorithm to solve that joint prompt optimization. That's why it's called joint.
So what's the underlying issue there that we should care about as potential users of this technology?
There are some problems that cannot be just solved by kind of one instructions or rule, I would say.
But they necessitate some sort of higher level reasoning or some sort of decomposition.
And in that sense, it would maybe be useful to actually have multiple calls to the LLM,
where each call is modulated by a different instruction. So the first instruction could be
something very general, for example, decompose or visualize the problem into a different language that is
formulated in. And the second goal is now recompose this visualization that you have produced
to solve the problem itself. And so basically in that context, you can think about this as
kind of augmenting the computational power of the language model by splitting the one goal
in multiple goals. Well, going a little deeper on the work that this builds on, all research kind of
gets a prompt, no pun intended, from previous work. So how does your work build on and or differ from
what's been done previously in this field?
I would say that our work started more with this intuition that LLMs are just kind of black box computation units.
Now this sort of black box can accept input as input language.
The computation is motivated by an instruction in the output language.
So you can stack these layers, right?
So if the weight of this language layer now are the instructions
and you can stack them together, how can you optimize them?
Right.
And then we start to think, OK, but this is very related to kind of automatic prompt optimization.
The overall kind of prompt engineering and prompt optimization approaches right now work by proposing some prompts and accepting some prompts. So we did some modifications with respect
to how we propose new prompts to language model and how do we evaluate and accept those that work
given some task inputs and outputs. Our goal in the future, I would say in the near future,
is going to be to basically integrate optimization that can really express arbitrary graphs of LLM calls
right now. But in our paper, we started with the first step, which is, okay, say that I just have
two calls. Can I just optimize prompts for that very simple graph? And we proposed an algorithm
to do so. So basically, I guess our main contribution is one, getting a better prompt optimizer for one layer and also devising an algorithm that works for two layers right now.
And that can be extended to multiple layers.
But that's also an engineering problem that needs to be done.
Yeah, we've got to get the engineering in there.
Well, listen, let's keep going on this because it sounds like you're talking about methodology and how you conducted this research.
So expand a little bit on what you did actually to experiment in this arena.
Yeah. So I think that really the birth of this paper started from this kind of view of these language models as layers modulated by instructions that can be stacked upon each other.
From there, we said, okay,
what can we do with this basically?
Some of us worked on datasets that could be interesting for
this new methodology, I would say, or architecture.
Basically, one question was,
how do you go forward to actually test if this works
in any way? And so we tried to select some datasets that were more of natural language tasks,
for example sentiment classification, and some datasets that were more about reasoning tasks.
And our hunch was that basically stacking multiple layers together would help more in those tasks that would require some sort of decomposition or reasoning.
And for the reasoning task, we worked with this big bench hard setting.
And so parallel to that, there were some of us that worked, for example, myself in the optimization part, really in the algorithm part.
And at first, we tried to do some sort of backpropagation.
But I quickly saw that there were some sort of issues with that, probably empirically
issues.
And so we tried to actually have a more formal understanding of this optimization algorithm
by recurring to variational inference, basically.
So basically, to understand actually the first layer
as producing some text
and considering this text as a latent variable.
When you open that box, it links also in your head
to all a bunch of kind of relating works in the literature
that have studied this problem very, very thoroughly. And so you can use those techniques into this context.
Interesting. So what were the results of this research? What did you find?
So what we found was that indeed the tasks in which this approach seemed to help the most
are the tasks that require this sort of decomposition
and reasoning. The first thing that was really, really kind of cool, it was that kind of you
can go a long way in improving the performance of this language model by accurate prompt
optimization. Because in some models, prompt optimization can be understood as kind of
really tweaking the models towards solving the task.
But in some other tasks, actually, when humans write prompt, they tend to maybe underspecify the prompt or tend to basically be not very clear how to instruct the model.
So the model has to do a lot of work to understand what the human really wants to say to them.
And so basically, this sort of prompt optimization acts as a sort of translator where it
formulates a prompt that more comprehensively describes the task and more comprehensively
contains some rules to solve the task. So it was very interesting to me, that kind of level of
abstraction that was sort of required and needed in the prompt to really solve this task very,
very well. The other finding is that this problem is very hard.
It's very tricky to optimize to prompt this type
of optimization, because this type of optimization
doesn't really follow a gradient direction,
like in deep neural networks.
It's basically a sort of trial and error.
And this trial and error is very finicky.
There's a lot of problems there.
But I feel like I'm hopeful in the sense that this paper allowed us, I think, to own in
some very specific problem that if we solve them, we can make the problem much easier.
Let's talk for a second about real-world impact of this research.
Let's extrapolate out from the lab
and move into life. Who benefits from this most and how do they benefit?
I think that, as I said before, like these automatic prompt optimization methods could
benefit, I think, a large audience or a large amount of users, I would say,
because they could be understood as a sort of translator
between the user needs and what the LLM can do.
For example, one effort here in Montreal that was led by my colleagues
was kind of building this sort of interactive agent
that would, through interaction with the user,
form a prompt, but interactively.
So for example, in DLN, like in our paper,
we assume that we have a task
and we do not have input or interaction with the user, right?
But in more realistic scenarios,
you might want to actually instruct your model to
do something by some sort of active learning process where the model actually proposes you
whether what it did was favorable or desirable or not, and the user can actually interact with that
output, right? For the multilayer case, my hope is that that would be useful to build and optimize these large sort of graphs of LLM calls.
I want to take a second here to spell out some acronyms.
You've referred to DLNs, and I don't think our audience might know what that means.
I'm assuming they know LLM means large language model.
That's sort of in the parlance.
But talk a little bit about what that other acronym is.
Yeah, sorry.
I didn't mention this.
So DLN is basically how we refer to these architecture that are composed of language
model layers.
So DLN is spelled as deep language network.
People are free to use this name or not. I'm not a big fan of imposing acronyms on the word, but that's a shorter version of
it.
So, yeah, so it's really the idea that a language model is a layer in this hierarchy, and the
layer accepts as input a text, as outputs a text, and really is modulated by an instruction or prompt that we want to learn.
And so the DLN is a deep language network,
and it sort of acts as a deep neural network,
but using language models as your layer.
Exactly.
Okay.
Yes.
So this is a question I ask everyone,
and it's sort of like,
how could you boil this down to one little takeaway if you're standing on an elevator with somebody and they say, what do you do, Alessandro?
So if there's one thing you want people to take really as a class, I would say, of
probability distributions, and that they are modulated by these prompts. And so basically,
once you have that, once a language model just defines a p over sentences given some prompt,
you can apply a lot of algorithms with those models. You can apply algorithms that resemble to
EM, expectation maximization, or, I mean, we
apply the form of that with variational inference, but maybe
kind of it could open the path for other types
of usages where kind of these are just very, very
powerful probability distributions over these sentences that
are considered as latent variable.
I hope that our paper can show a more or less practical kind
of implementation of that idea, and that basically,
if you have to optimize, for example,
prompts with one or two layers, you can definitely
try our approach.
LESLIE KENDRICK- Well, finally, and we've been talking about this kind of already, but there seem to be some
unresolved problems in the area. What are researchers like you need to be looking at
in order to solve those? Sort of what's next on the research agenda, whether it's you
or other researchers in this field? So let me try to answer by something that really excites me now.
What we are doing is that we are producing text, right, with the language model.
But we are producing this text in such a way that it helps to solve a problem.
And basically, this variational inference method and kind of framework gives us a way
of understanding what does it mean to be a good text?
Like what does it mean to be a good latent variable
or useful latent variable?
Right.
What does it mean to produce good data?
So for example, these big models kind of are really data creators,
like generative AI, right?
But can we actually teach them to produce data
such that this data can be helpful to solve tasks or to condition those same models to solve a task?
Right.
And what are the objective functions that promote the production of this useful data?
What useful means from a mathematical perspective. I think that apart from the prompt optimization angle,
I feel like DLN to me kind of opened a little bit my spirit
into kind of investigating ways of understanding
what does it mean for some generated text to be useful
to solve a task, I would say.
Alessandro Sordoni, thanks for joining us today
and thanks to our listeners for tuning in.
If you're interested in learning more about this work, you can find a link to the paper at aka.ms
forward slash abstracts, or you can find it on Archive. See you next time on Abstracts. Thank you.