The Infra Pod - Let's stop prompting and start programming... Chat with Omar about DSPy
Episode Date: November 6, 2023Ian and Tim sat down with Omar Khattab about the research project DSPy, that's changing how users interact with LLMs from creating prompts to have a programming framework that searches and optimiz...es the best prompts for the task. Come listen about the thought process around DSPy and how it can fundamentally changes how we interact with LLMs and AI models as a whole!
Transcript
Discussion (0)
All right, welcome back to the pod.
Yet another infra deep dive.
This is Tim from SMBC.
Well, Ian, take it away, sir.
Hi, I'm Ian, doing some angel investing,
helping sneak turn to a platform,
lover and builder of infrastructure and dev tools.
And I'm super excited today to be joined by Omar, one of the authors of the DSP paper.
Omar, can you introduce yourself and tell us what DSP stands for and why we all care?
Sure. Thanks a lot, Ian and Tim, for hosting me.
So I'm a PhD candidate at Stanford.
In general, I build stuff around retrieval.
You might know me from the Colbert model and follow-up work there.
I'm also an Apple PhD scholar.
So DSPy is the second version of the Demonstrate, Search, Predict, or DSP project.
And that's a project we started in the first half of 2022.
And then we open sourced it and released the paper kind of late 2022, January 2023.
And DSPy is kind of the evolution of that.
DSPy is basically this programming model where you can program and not prompt foundation
models.
And this is basically unlike anything else that exists in this space, which has become
really crowded.
There's lots of frameworks for working with language models. But we have this unique emphasis on working with not prompting techniques,
but modules that resemble kind of the architectural layers
when you're working with neural networks.
So when you have a problem that you want to solve,
especially when you're trying to build something for a new task
that is kind of unique to your use case,
you don't start by writing string prompts
and thinking about how to chain them and connect them.
And you don't also look for a predefined thing
that someone else built necessarily.
Instead, you think about what are the stages
of the pipeline in my system
and how do I map them into a control flow in Python
that's going to use the language model
through composable modules.
So these composable modules are going to be things like, hey, I want a chain of thought component that is going to take a particular signature.
So maybe it should take questions and give you answers, or maybe it'll take documents and give
you their summaries or whatever it is. And the idea is that you will express these signatures
at a high level in a Pythonic sort of control flow. And given a metric, you can then ask a
compiler, which is the DSPy compiler,
to take whatever program you've written and optimize all of its steps towards that metric.
So what that means in practice is that these high-level modules that you describe in one or
two lines and these signatures that you've assigned to them will be mapped internally
to these long, complex, high-quality prompts that you would otherwise have to sort of maintain as
messy strings that are very brittle. And because these things are so general,
the same program that you write could be mapped to a really high-quality prompt for GPT models,
could be mapped to really high-quality prompts for LAMA or other local models,
and can actually be mapped into fine-tunes that are automatically constructed for whatever
sequence of steps exist in your task
that achieve really high quality. So a lot of people are thinking, should I do retrieval
augmentation with language models? Should I do chain of thought prompting? Or should I do tree
of thought prompting and other fancy things? And what we're saying is, these are not decisions you
want to be making at the level of fancy string manipulation tricks. These are actually high
level strategies that can be composed as actual Pythonic modules.
And the same program can be compiled
to many of these different things automatically
because the transformations that are there
are actually entirely automatable.
And what we see is that we can get really high quality,
in many cases better than you get
by writing the prompt by hand,
but it's in this form that is super maintainable,
really extensible, and really clean.
This is so amazing. I think this actually has the potential to be really completely changing
the whole industry to some degree. Just to kind of summarize, what everybody does today
is they have to pick a model, they have to pick a prompt, they have to figure out a bunch of
configurations, and if you even want to do any optimizations, you basically have to do everything by hand, iteratively knowing all
the details of everything. So the DSPy, like you said, creates a framework for you to express
some steps and then your compiler goes and trying to figure out. There's so many things we want to
talk about, but I definitely want to maybe talk about sort of the high level. Like when it comes to the approach, what we are pretty curious about is there's many different kinds of ways you can express something, right?
You can express into a sequence of steps.
You can express maybe even a higher level intent.
Like how did you come up with the expression sort of framework or language?
Like what are the sort of trade-offs when it comes to deciding that?
And who is it meant to be for? Who do trade-offs when it comes to deciding that?
And who is it meant to be for?
Who do you imagine someone can be able to grasp that?
And maybe talk about
some of the trade-offs
even on the high level.
Yeah, for sure.
So I started my PhD
with Matej Zahariega
and we built this
Colbert retrieval model.
And then I also joined
Chris Potts' lab at Stanford
who does NLP.
And the idea was,
look, this retrieval stuff is going to be real big.
And this was like late 2019, way before all the recent stuff that's happening now.
And so the intuition is we can use retrieval to really improve and change the way language models tackle tasks.
So we started by building pipelines that improve the factuality and efficiency of language models.
So we can answer questions based on retrieval.
This is now very common and mainstream.
And quickly, we realized that there are two types of challenges that emerge.
The first is when you're building these pipelines,
it's often the case that you don't just want to retrieve something
and give it to the model in its prompt or in its context window.
You actually need a lot of back and forth.
So we built a system called Baleen that does multi-hop reasoning.
And what that means is that you have the language model
with the retrieval in a loop,
sort of generating queries, searching for things,
summarizing the context that's found,
adding it back into the queries,
and sort of iterating until the language model,
potentially after several hops,
finds all of the information that it needs
in order to factually answer a complex question or fact-check stuff. And what emerged from there was that there is a second challenge,
which is, okay, even if you know how to get the right pipeline and you connect all the pieces,
which is a big challenge, how do you train or supervise these steps?
Back then, prompting wasn't a thing yet. And so you had to fine-tune all of these components.
And the problem is, there is a lot of datasets sets that's like, here's this question, and here's the answer. Please
answer the question. But there aren't actually a lot of data that's like, here's the question,
here is the set of paragraphs you need to retrieve, here's the next query you need to generate,
here is what that should retrieve as well. And then those intermediate steps,
you do not have data for them. And that's actually a fundamental thing.
The reason it's fundamental is that
I believe pipelines should be easy to change and tweak
the way we write programs in general.
And every time you come up with a new pipeline,
you need new data because you might create new steps
that didn't exist before in this context.
So DSPy is basically our solution for that.
And it took us a long time of trying a lot of different things
to arrive at the final iteration that we have now.
We've released lots of this from very early on as open source.
Obviously, the latest stuff is also open source.
And what we arrived at is that we really want to empower people
to build their own system architectures or their own pipelines
where they can specify the signatures of the steps.
And signature is kind of a really key abstraction in DSPy
where normally function signatures, like you just specify the input-output patterns.
So some people might, for example, use Pydantic or just a Python specification,
where you say, I have a string and I need the following set of integers and whatnot.
But in DSPy, these are actually going to be natural language specifications.
So you're going to say, I have some context and a question and I want a search query.
And DSPy is going to look at these keywords in natural language and will basically work to infer what you meant in the context of your pipeline.
So once you have these signatures and you connect them in your code through nothing but basic PyTorch dynamic calling.
So basically, if people are familiar with PyTorch, when you're defining a network in PyTorch, you just call the steps in any loops or if statements or any dynamic control flow you want,
the same as in DSPy. So very natural abstraction. So once you've sort of created your pipeline that
way, the real question is, how does each step get implemented? So I have the step that's supposed
to generate search queries, let's say, or I have the step that's supposed to generate answers based
on context that's retrieved from my search model. So that was the challenge about
the data. Turns out that the answer is, if you can have a metric towards which you want to optimize,
you can sort of treat these intermediate stages as essentially a latent variable or something
you want to learn as part of the bigger pipeline. And then you can see, roughly speaking, which
intermediate outputs and
inputs both ascribe to the signature that you have, so they respect the constraints of your
semantic signatures, as well as lead to high quality outputs for the pipeline. And then you
can sort of go back and optimize the whole system so that these examples could serve as parts of
prompts, or they could serve as fine-tuning data for each of the stages of your pipeline. And from that abstraction, what we arrived on after a lot of different attempts
is that this actually looks like a neural network, where you have a lot of layers.
There's an input at the beginning that you have, maybe a small number of in this case.
And maybe you have some finite labels at the end, or maybe it's self-supervised, where you don't
need a lot of labels or any labels. That depends on your metric. And so there's essentially a loss function of some kind at the end. And there are these layers,
which are actually kind of representing prompting techniques. Like one layer could be,
I want a chain of thought component here, or I want a program of thought component here. So
that's like asking the model to generate code in order to solve a task. And you just connect them
in any pipeline that you want, kind of like a feed-forward network.
And then you have these optimizers, which are sort of where the compiler comes in,
that sort of looks at whatever context your program is executing in. So whatever data you
have, for example, and looks at what metric you're trying to optimize, and then treats all of these
intermediate things as variables to optimize or as parameters to optimize.
That's incredible. And I mean, it's really interesting, right? And in many ways,
you're building a neural network
on top of all these smaller little modules.
Your system helps construct that network
and train the weights.
For me, it's very intuitive
the way that you describe it in that manner.
Help me understand how Drift works.
How does the framework relearn?
How does it learn new interactions
as the underlying models change?
And also as the inputs
that I'm giving the network
change. How is it learning over time? Understanding Drift and how it learns is, for me, the thing that
I'm really quite interested in, and also the way it can adapt to new scenarios.
Yes. So what we're doing here is that we are adding a new level of modularity that just doesn't exist
in existing things. And that level of modularity is saying what you want to express is not a prompt.
Because a prompt is something that's highly specific
to one particular language model.
It's like when you're trying to describe your program,
I don't want you to think in terms of the binary,
the final exe file that you're going to produce
or the final binary that you're going to run on your Linux.
Because that's very specific to the operating system
that you have, the hardware as well.
I want you to think of the language model
as this sort of really kind of device that
you can instruct, but you do not want to start instructing a particular model.
So now that we've separated the program structure and the steps and the signatures from the
specifics of the language model, you can then say, take my program and take my favorite
language model or two or three, and then compile things for it in a particular way.
This strategy of compiling is called the teleprompter.
DSPy has a significant number of teleprompters.
It's easy to add more because it's super modular.
Teleprompters are basically these optimizers
that can take any program and a metric and some data.
So data basically here just means inputs.
So maybe some questions you want your system to be able to answer
or some documents you want it to be able to summarize
or whatever it is.
Potentially some labels. Generally, we need labels for the final output so that your metric knows what to evaluate. But in many cases, you don't even need that.
Certainly, you don't need labels for the steps of your pipeline because that would break modularity.
So you don't need labels for that. And we generally don't assume labels for that.
By the way, a teleprompter stands for prompting at a distance. A typical teleprompter, although
this is completely open-ended,
what it will do is it will basically take your program,
find the modules that are in there.
So it'll be like, oh, there's a chain of thought module here
that takes questions and generates search queries.
And there is this other module here that is maybe a React agent
that takes the question and maybe one suggested answer
and then calls a couple of tools and then revises the answer, for example.
Now, these modules themselves
have internal parameters in DSPy.
And these parameters are things like,
although we're adding a bunch of things there,
primarily demonstrations.
So basically, when you have an agent,
it's basically a prompt or a bunch of prompts
that can call the language model multiple times
in order to teach it how to interact with tools.
So the key component there is really examples of using different tools properly. And when you're
starting with your pipeline, that set is empty, right? Because no one is going to write that
prompt for each particular set of tools in each particular pipeline that you're going to work with
in the way that works for each possible language model. That's obviously extremely unscalable and
very brittle. Instead,
a simple teleprompter might say, I'll take your program, and I'll take your language model,
and I'll build a basic zero-shot prompt that I can use for each of these modules. So that's before I have any demonstrations. Now I can take your input questions or whatever input your pipeline
exists, run it through the pipeline, potentially multiple times, potentially with high temperature,
until I can get examples that actually work with your metric, maximize the particular metric that
you have. And once I have a few of them, I can bootstrap this process. This is something I can
self-iterate. I could basically take those, put them in the prompt, see if they actually help on
new examples within the data that I have, which might be very small. And at the end of this
process, what we see is that pretty complex pipelines,
so you could have several steps of calling language models,
retrieval models,
actually work really well with pretty small models.
So it could take a LAMA to 13 billion parameter model,
and basically it can teach itself
how to simulate a React agent,
which has fairly complex interactions with tools.
Or it could teach itself
how to do multi-hop question answering,
or how to solve math questions, etc. And what we see
is that this works a lot better than zero-shot
prompting, and in many cases works a lot better
than writing the prompt yourself by hand.
You know, if you take prompts that other people
wrote in the literature or elsewhere.
And this is something that you can contrast with existing
frameworks, where they can give you,
hey, here's the React agent architecture,
you give it the tools that you have, and what it will do is it will map to this fixed zero-shot
prompt that just tells the model, just do your best, and it usually doesn't perform too well.
In DSPy, that's a program that you can actually compile and get a highly optimized system for your
task. I mean, this is really interesting, especially at the beginning, if we scroll back
to last year, OpenAI GPT-3 comes out.
That was a moment of notice for me.
We had a lot of this discussion around prompt engineering.
It was this whole, we're going to have this huge innovation.
We're going to have this whole new class of people.
The AI analogy for the analytics engineers is going to be the prompt engineering.
They're going to specialize in the ability to prompt specific models.
And I think what's interesting about DSPy,
it sounds to me like, one, we're lowering the barrier to entry.
So you don't actually have to have a human
that particularly understands the specifics of this model.
It's a layer of abstraction above.
So how are you making it easier for people to build
deep-learn, LLM-based applications?
Are you really actually making this simpler
and therefore democratizing access to
these systems? What do you think this is doing and how is this going to make it easy for all of us?
Or is it going to make it easy for us to build LLM-based apps?
Great question. I think what we're doing is a bit more general than democratizing this. I think we're
just fundamentally changing the way you approach it. And the reason for that is, you made this good
point about you do not need deep
expertise in writing prompts for particular models when working in DSPy. And that's true.
But I'm actually going to go farther and say, it's not that you don't need it. It's just that
that's their own question to ask. Because language models are great at a lot of things. But what they
reliably convince us, we see this all the time,
is that they just can't get consistent reliability on bigger complex tasks.
They can do small hops of reasoning. You can give them a document and ask them to extract something.
And with good models, you can get really good quality there. But if you're just going to ask
them open-ended questions about really niche things in the world, that's not going to cut it.
And so the issue isn't really prompting, although that's kind of an annoying thing that we have to get over.
The real question is, what is the structure of my problem and how should I build a system architecture?
How should I do good software engineering such that I can take some user input or whatever it is that I want to process and get to the final destination of the output
that I want to produce.
And our sort of premise here is that
invariably for real complex tasks,
that's not one language model call
where you want to get the best prompt.
That's a lot of back and forth
between different components.
Maybe I'll call the language model once or twice,
maybe more,
but there's definitely more processing going on there.
And once you sort of frame it like this, it's not a question of like, can we make prompt engineering easier or harder?
It's just infeasible to do good prompt engineering at that complex scale for a pipeline. You know,
you could pick your favorite other framework that exists now and then open your file structure. And
what you'll see is like tens of files called prompt.py. And what you see is like, here's the
SQL prompt for SQLite
or for MySQL or for other things.
And for each database,
there's like a slightly different prompt that does that.
Now for math, there's a different thing.
But what if I have a pipeline
that needs to do a little bit of math
and then some SQL?
Do you just connect these two?
Like, is it just going to work for all models?
Yeah, it's not going to do that.
Not reliably, at least.
So what you're going to have to start doing
is like tweak these two prompts together
and then the provider, maybe OpenAI, changes their model
a little bit and everything breaks. But if you start thinking of like,
well, I just have a pipeline that does this math thing and then does this
call to the model to generate SQL, and then I can
sort of give that to the tool, which might complain
if the SQL is bad, and you sort of build that as a pipeline
like this, the model changes, the right, if the SQL is bad. And you sort of build that as a pipeline like this.
The model changes, the data changes, the SQL database changes.
Cool, I'll just recompile.
You know, internally during compilation,
it'll discover all the issues, it'll iron them out.
And then you'll get this new compiled sort of artifact that you can save, right, for reproducibility
and just for good software engineering.
And then you can load it and it works with, you know,
so long as your system components are fixed.
So does this make things easier?
There's a learning curve.
You know, you kind of got to learn all these new components, which are like three things, right?
There's the signatures, there's the modules that we have, which can, you know, take these signatures and learn how to prompt.
And then there's these optimizers or these teleprompters that you can use.
So, you know, it might take you a day or two if you're already familiar with like some of the general prompting things here.
But once you've done that, it just completely changes the way you think about building these
things. Now, obviously, a lot of folks are starting to realize, you've got to be AI engineers,
not prompt engineers. I like that framing. And a lot of this, I think, in the near future is
going to start to be about, hey, how do we get the right pipelines, not just for solving tasks,
but for supervising tasks, which is the kind of problem we've been working on
for the last three or four years
for these kind of multi-stage systems.
And we have a lot of rich stuff in DSPy
that could basically do that.
So hopefully this answers the question.
Yeah, I think that's probably a perfect segue.
We're going to talk about our next spicy future section here.
Spicy futures.
Obviously, there's all these new primitives
you're introducing, but it should
and it should be way better
output and quality for users.
But given your point of view
and you've done a ton of work in this space
and also probably interact with a lot of people
already talking to you about, okay, how
clean leverage is, what are future work
and stuff like that.
Maybe we'll ask you, what do you think will happen in this AI engineer space in the next three to five years, right?
Do you see prompting just goes away?
Do you see sort of framework approach, declarative approach to overtake the world?
What do you think will happen in this AI engineer space?
And how do you think are the things in sequence will need some breakthroughs to make that happen as well?
Okay, these are three great questions.
So help me if I forget any of them.
The first one is like my hot take on the future of this.
And I'm convinced that the right approach for all this stuff, the research especially,
but certainly a lot of the products that are going to stand out are going to be about this transition to thinking of these pipelines as very similar
things, as very similar artifacts to neural networks. So it's going to be a lot less about
the model. It's not going to be a GPT-4 versus LAMA. It's going to be what is kind of the right
sequence of solving these problems. And we're going to start thinking about general purpose ones of these.
So agents is a great case there.
But agents are jumping way too far ahead
because we can't even get a lot simpler structures
that are more deterministic to work.
So starting with agents is kind of too soon
to get the general purpose agent working.
But this sort of is vaguely in the right direction.
So I think what's going to happen
is we are going to see a lot less focus on like prompting tricks and a lot more focus on general
purpose reusable modules that can be sort of reused together as building blocks in solving
these tasks and new optimizers for full pipelines. Basically much richer versions of what we already
have in DSPy, which hopefully can sort of already be integrated
in the existing abstraction that we have.
That's on one hand.
On the other hand, what I think about prompt engineering,
working with strings and language models directly,
I don't think that's going to go anywhere,
but it will be a lot less mainstream
in the sense that right now there's a sense,
and it might be true if you don't use DSPy,
that everyone who's building things with language models
needs to have a pretty good grasp on how to make sure
you get around how brittle they are with prompts.
And I don't think that will sort of reach the place
where everyone actually needs to know it really well,
in the same sense that not everyone needs to know assembly.
But I actually bet that some people need to know assembly
so that when we build compilers, we kind of get it right.
You still have to talk to the machine at some point.
You can bootstrap, you can use your compiler to compile your compiler.
But at some point, somebody got to understand
and use this to either discover new things.
We're still discovering new capabilities for language models.
And I don't think that should stop.
As well as just get the best possible compilers,
like what we have in DSPy.
That requires some level of meta prompt engineering, if you will, like across all tasks and across models potentially.
So that's in terms of prompt engineering.
But what I think will happen is not everyone needs to do prompt engineering, that's for sure.
But actually not even everyone needs to learn something like DSPy per se,
because what will happen is this stuff will be more democratized
to the extent where higher level abstractions
above things like DSPy,
what they will mean is
there is going to be this general purpose program,
not a bunch of prompts that I wrote
or someone else wrote,
but it's a program that could learn its prompts.
And then these higher level abstractions could say,
just give me your data.
Internally, I'll compile the DSPy program.
It will get to learn its own prompts for your data
without you knowing about how that works
or what's going on internally.
And you'll get this pipeline that's highly optimized
for efficiency and for quality on your use case.
And for 90% of the use cases, that's probably good enough.
Now, a fixed prompt that's trying to do everything
for everyone is not good enough.
But a program structure that can be compiled on your data for typical problems like chat with your PDF or these things that are recurring, I think that will just be a really are many of us who are building new things and sort of like thinking about solving problems with more sophisticated and
more interesting pipelines, because obviously, this is very, very early stages.
One way to think of this is we're sort of like approaching the ResNet or even AlexNet,
if you will, moment of neural networks.
We're like, oh, we can actually put these things in layers and get really better quality
through the depth, you depth, if you think of
deep learning. And what we're saying is like, oh, here's PyTorch or TensorFlow or whatever. Here's
a framework that can actually allow you to think about these things in that general way, as opposed
to hacking together some C++ or some MATLAB or something. And a lot of people don't need to build
BERT from scratch or GPT-3 from scratch. They could just download the architecture, but many
people need to actually code it.
And so that's the abstraction stack
that's going to emerge.
I'm super interested to get your take on
when you go to build a piece of software today,
you're a Node developer, JavaScript developer,
Python developer, you're NPM installing,
you're pip installing a module.
Do you envision a world where the future
of AI engineer is
I'm going to pip install DSPy, I'm going to pip install these DSPy modules,
and then I might write some of my own teleprompters,
some of my own modules that wrap around specific things,
bring my own model, if you will.
I'm curious, how do you see this ecosystem evolving?
And then also, how do you think commercialization,
so we have big compute providers charging lots of money
to give you access to their trained models
and to the private data they've used to create those models.
I'm kind of curious to get your view of how you see the industry evolving
and how that impacts what a future AI engineer may use
to actually build and solve a problem.
After a long time, this was not quick.
I've come to realize that it's really uncommon
to have a good reason to break away
from this neural network slash PyTorch
in particular way of thinking of things.
So I think something like DSPy,
I think DSPy in particular,
might become a dependency in a lot of these things.
A lot of infrastructure can still exist around that.
So you still need good models
that could serve as the starting point
for sort of bootstrapping these processes
and compiling as well as deploying them.
But what's going to happen is that
these models will be a lot more replaceable.
What exists right now is that there's this huge lock-in.
If you have these prompts optimized for GPT-4
and it's kind of like quite capable and whatnot,
you're kind of stuck with that.
But what we're saying is
when there's an automatic optimizer
that could try a lot more things than you as a person can, and that could go through automated processes,
it's not like just a language model is doing magic. It's like there is a sort of systematic
optimizer that is going through an algorithm there. You can sort of automate a lot of the
hill climbing such that you could compile a T5 model that you could run on your CPU.
And for many sort of pipelines basically do as
well as a really large, expensive model. So what's going to happen, I think, is just broader scope of
being able to offload calls of language models of various sizes and kind of much higher variety
as parts of programs, parts of these chains that people are building. And that comes with all things around monitoring and tracing and whatnot.
And for the people building programs with these language models,
increasingly, I think, especially with the abstraction of DSPy,
where I guess it's what's called defined by run,
meaning you just write your code,
and then it can actually look at which places you call the language model in
and sort of under the hood figure out how to do the fine tuning and prompting. What we're going to see is, I think,
increasingly the gap between just general programming and programming with
models is going to be blurred. So people will just start to use these modules
in their normal code to do general verification,
validate user input, and do other things. But it's not going to be by writing
a prompt. It's going to be through programming.
So that's the whole message of programming foundation models
and not prompting them.
I'm very curious, how do you see DSPy and maybe even related?
Because I think DSP happens, DSPy,
almost like abstraction on top of it.
Based on even your description,
we're assuming there's going to be even more abstraction
maybe on top of this, right?
Yes.
To make it even more easier. So I think what will be the progression
here? Because looking at DSPy today, obviously it's early.
Looking at notebooks, you have to kind of pick your teleprompter, pick a lot of things.
You have to know intuitively everything that's there to be able to kind of piece together
the structure of how you want to be able to use the framework. But, you know,
eventually it seems like we want to move forward
with a much simpler model,
hopefully for the majority use case, like 90% of that.
How, I guess?
Where do you see the next level of work will happen?
Do we need to make DSPy have a lot more
sort of like general optimizations
and then we have something to express it over even more?
Where do you see that happen, maybe in the short term?
Yeah, I mean, in the short term,
I think DSPy is just expanding the capabilities
of what we can do.
And it's kind of targeting the large number of people,
although it's not the majority,
who are upset with the way prompting looks
and want to build powerful things
in a more systematic way.
And I have to recognize
that a lot of application builders,
they're not trying to build
the best language model pipeline.
They're trying to build an application that's facing a user.
And the details of how to build the best pipeline is not their main focus.
So for that, I'd like the stack that we have in neural networks
where something like TensorFlow or PyTorch, right?
There's above that something like Hugging Face Transformers.
And Hugging Face Transformers doesn't tell you like,
here is how you build your whole new
transformer from scratch and
give it new layers that are completely
novel. No, it just tells you here's BERT
if you want to use BERT, here's GPT-2 if you
want to use GPT-2. If you want to do some lightweight
quick fine-tuning on top of BERT, sure,
it's easy to do it this way. But at the end
of the day, it's this kind of higher-level wrapper
around PyTorch, mainly, that
gives you sort of off-the-shelf common models as something you could simply reuse as part of your application
or kind of higher level research. And I think that's exactly what we're going to see here.
I should add that the most popular set of frameworks that existed over the past several
months and so are trying to act at the Hugging Face level. But the problem is there is no PyTorch
beneath them. It's kind of like implementing Hugging Face level. But the problem is there is no PyTorch beneath them.
It's kind of like implementing Hugging Face
transformers, if folks are familiar with that,
using normal
Python. So under the hood
it's extremely inextensible.
It's like trying to do loops
instead of calling neural network layers
or hard coding the weights of your
neural network instead of just optimizing with
your favorite optimizer,
like Adam or SGD or something.
What I think will happen, and we're talking with people,
is like, hey, you already have this kind of cool high-level abstraction.
Do you want to build things in it using the abstractions we have in DSPy
as a way to make development itself a lot more general and faster,
but also as a way to make it a lot more adaptable
so people could take your high-level pipeline and just do this high-level recompilation
and get a better program without relearning sort of all of the internals of DSPy.
Although I should add that for someone who's thought about these pipelines, I think DSPy
comes very naturally because there's basically like six or seven modules that you could pick
from at the moment, which might sound like a small number,
but if you're thinking about layers of neural networks,
there's not too many.
Like, you know, you could have dropout,
linear layers, convolution layers,
you know, recurrent layers.
There's not too many things,
especially the key ones, you know,
that you can reuse as building blocks.
And there's a few optimizers or teleprompters
that you could use from.
And then you do your thing.
You can structure them in any general Python code
that you want.
Use your loops, exceptions, whatever it is. Just call the right modules in the right places.
It's defined by run. And so I think that stack is going to really change
the way we think about these programs.
That's really interesting. I'm curious, if you had to suggest for someone who's a complete newbie
to LLMs and AI and what could be built today, where should they start?
Where is the place to dive in?
And then what would you tell them they should go build and try?
What are they ultimately trying to do?
Is it just like learn and get a good... Trying to learn, understand what impact all of this could have on the business they're building.
Maybe they work at Cisco or work at some large enterprise.
And they said, go figure out AI.
And they stumble across this podcast, they stumble across your work,
and they're really interested. Where should they get started?
I think it's useful to understand this potential emerging stack.
So I'd say DSPy is exactly in the middle, where we're sort of
giving you these high-level primitives that you can compose.
And the whole abstraction is like, I mean, if you look at our code, it might be around
like 3,000 lines of Python that we can refactor to be even less.
It's a very small framework that can do a lot of things.
So I think it's a good place to start to sort of understand that this is not about language
models.
This is about building these high-level pipelines and programs and understand what kinds of
moving parts are important.
Then you want to probably go up a level and understand,
well, what are people actually building with language models
at the application layer, which is probably where someone
just looking into this stuff might want to jump in.
And so look at the agents in Lanxchain,
look at the chains that they have built in.
And then you probably want to go a step beneath all of these
and look at like, okay, what if I wanted to write the prompts
myself manually?
And I think with the understanding of these three layers of the stack,
the device itself being the language model,
the programming model for sort of automating interactions with them,
and then like kind of prepackaged chains around all of that,
I think you'll be in a place to actually select
the layer you want to solve problems at.
So if your problem is already solved by a high-quality chain that someone else built
and you validate that the cost is fine with you,
the quality is fine with you,
maybe that's all you need.
And I think most of the time will be the case.
It turns out you need to customize this,
make it more cost-effective,
make sure the quality is better,
be able to iterate over time to fix issues
and adapt it more.
I think you basically only have DSPy
and drawing your own thing right now if you want that level of control and adapt it more. I think you basically only have DSPy and drawing your own thing right
now if you want that level of control
and iterative development. And so
just understanding those three layers of
the stack gives you options.
Cool. So one last question.
It's super exciting that you are
really changing
a lot of the paradigm. It was great
that you also in the readme have a lot of sections
comparing why you should use DSPy
versus, you know,
just general prompting
and link chain
or some equivalent frameworks.
And really one section
that really kind of like
stood out to me
is like, hey,
instead of the generic
hard-coded prompts,
DSPy doesn't contain
any specific prompts for you,
right?
It learns how to optimize.
It learns how to generate it.
I think to be able to optimize anything well, you should evaluate something well, right? It learns how to optimize. It learns how to generate it. I think to be able
to optimize anything well, you should evaluate something well, right? I need to be able to know
I'm making progress because otherwise I could be optimizing to something. But today, even this
evaluation step is very hard to pinpoint. I wonder how you think about that part of it,
because I feel like that's a very crucial part. How do you make the evaluation
part work? So this is a difficult problem inherently. What DSPy offers is the right
framework to think about it as an iterative process that will keep getting better over time.
So we have this notion of metrics. And one thing we've been doing for a while is like
building metrics that are themselves optimized DSPy programs. So you want
to check, for example, you're trying to generate questions that have a particular quality,
you'd have a program that checks that your questions have that particular quality.
And optimizing a program of that kind is actually a lot easier than optimizing the original program,
because it's kind of a binary label for the metric itself. So what we give you here can
start with a small data and a simple program and a basic metric.
And as you sort of start collecting data
and eyeballing examples,
what you will see is where your metric is not perfect
and where your data is lacking
and where your program falters.
And you could basically iteratively improve
each of these three components over time,
not as like seeking the perfect thing right away, but as like, here's the basic program.
I'll compile it with a simple metric.
Actually, turns out I needed to maybe have a little bit more input questions here.
So I'll get some from putting this out in a demo or something.
Turns out now I can improve the metric itself to optimize the program towards a more aligned
view that I have of what the
program should be doing.
And you can sort of isolate the thinking between these three stages, as opposed to the much
more common approach of like, let me tweak this prompt a little bit and see if things
look better now.
That is the framework we want people to think in.
Awesome.
Well, thank you so much, Omar, to have on our podcast.
I know we have a ton of stuff we can even go for, but this is exciting and there's so much work to do.
Where can people find you and where can people find more about the project?
You can find DSPy on GitHub and you can just Google my name.
You'll find my website, email me or open an issue on GitHub.
Just write DSPy or DSPy Stanford, you should easily find it.
Thank you so much, Omar.
Yeah, thanks, guys.
It's a pleasure.