The Infra Pod - Let's stop prompting and start programming... Chat with Omar about DSPy

Episode Date: November 6, 2023

Ian and Tim sat down with Omar Khattab about the research project DSPy, that's changing how users interact with LLMs from creating prompts to have a programming framework that searches and optimiz...es the best prompts for the task. Come listen about the thought process around DSPy and how it can fundamentally changes how we interact with LLMs and AI models as a whole!

Transcript
Discussion (0)
Starting point is 00:00:00 All right, welcome back to the pod. Yet another infra deep dive. This is Tim from SMBC. Well, Ian, take it away, sir. Hi, I'm Ian, doing some angel investing, helping sneak turn to a platform, lover and builder of infrastructure and dev tools. And I'm super excited today to be joined by Omar, one of the authors of the DSP paper.
Starting point is 00:00:30 Omar, can you introduce yourself and tell us what DSP stands for and why we all care? Sure. Thanks a lot, Ian and Tim, for hosting me. So I'm a PhD candidate at Stanford. In general, I build stuff around retrieval. You might know me from the Colbert model and follow-up work there. I'm also an Apple PhD scholar. So DSPy is the second version of the Demonstrate, Search, Predict, or DSP project. And that's a project we started in the first half of 2022.
Starting point is 00:01:00 And then we open sourced it and released the paper kind of late 2022, January 2023. And DSPy is kind of the evolution of that. DSPy is basically this programming model where you can program and not prompt foundation models. And this is basically unlike anything else that exists in this space, which has become really crowded. There's lots of frameworks for working with language models. But we have this unique emphasis on working with not prompting techniques, but modules that resemble kind of the architectural layers
Starting point is 00:01:32 when you're working with neural networks. So when you have a problem that you want to solve, especially when you're trying to build something for a new task that is kind of unique to your use case, you don't start by writing string prompts and thinking about how to chain them and connect them. And you don't also look for a predefined thing that someone else built necessarily.
Starting point is 00:01:50 Instead, you think about what are the stages of the pipeline in my system and how do I map them into a control flow in Python that's going to use the language model through composable modules. So these composable modules are going to be things like, hey, I want a chain of thought component that is going to take a particular signature. So maybe it should take questions and give you answers, or maybe it'll take documents and give you their summaries or whatever it is. And the idea is that you will express these signatures
Starting point is 00:02:18 at a high level in a Pythonic sort of control flow. And given a metric, you can then ask a compiler, which is the DSPy compiler, to take whatever program you've written and optimize all of its steps towards that metric. So what that means in practice is that these high-level modules that you describe in one or two lines and these signatures that you've assigned to them will be mapped internally to these long, complex, high-quality prompts that you would otherwise have to sort of maintain as messy strings that are very brittle. And because these things are so general, the same program that you write could be mapped to a really high-quality prompt for GPT models,
Starting point is 00:02:55 could be mapped to really high-quality prompts for LAMA or other local models, and can actually be mapped into fine-tunes that are automatically constructed for whatever sequence of steps exist in your task that achieve really high quality. So a lot of people are thinking, should I do retrieval augmentation with language models? Should I do chain of thought prompting? Or should I do tree of thought prompting and other fancy things? And what we're saying is, these are not decisions you want to be making at the level of fancy string manipulation tricks. These are actually high level strategies that can be composed as actual Pythonic modules.
Starting point is 00:03:27 And the same program can be compiled to many of these different things automatically because the transformations that are there are actually entirely automatable. And what we see is that we can get really high quality, in many cases better than you get by writing the prompt by hand, but it's in this form that is super maintainable,
Starting point is 00:03:43 really extensible, and really clean. This is so amazing. I think this actually has the potential to be really completely changing the whole industry to some degree. Just to kind of summarize, what everybody does today is they have to pick a model, they have to pick a prompt, they have to figure out a bunch of configurations, and if you even want to do any optimizations, you basically have to do everything by hand, iteratively knowing all the details of everything. So the DSPy, like you said, creates a framework for you to express some steps and then your compiler goes and trying to figure out. There's so many things we want to talk about, but I definitely want to maybe talk about sort of the high level. Like when it comes to the approach, what we are pretty curious about is there's many different kinds of ways you can express something, right?
Starting point is 00:04:32 You can express into a sequence of steps. You can express maybe even a higher level intent. Like how did you come up with the expression sort of framework or language? Like what are the sort of trade-offs when it comes to deciding that? And who is it meant to be for? Who do trade-offs when it comes to deciding that? And who is it meant to be for? Who do you imagine someone can be able to grasp that? And maybe talk about
Starting point is 00:04:50 some of the trade-offs even on the high level. Yeah, for sure. So I started my PhD with Matej Zahariega and we built this Colbert retrieval model. And then I also joined
Starting point is 00:05:00 Chris Potts' lab at Stanford who does NLP. And the idea was, look, this retrieval stuff is going to be real big. And this was like late 2019, way before all the recent stuff that's happening now. And so the intuition is we can use retrieval to really improve and change the way language models tackle tasks. So we started by building pipelines that improve the factuality and efficiency of language models. So we can answer questions based on retrieval.
Starting point is 00:05:25 This is now very common and mainstream. And quickly, we realized that there are two types of challenges that emerge. The first is when you're building these pipelines, it's often the case that you don't just want to retrieve something and give it to the model in its prompt or in its context window. You actually need a lot of back and forth. So we built a system called Baleen that does multi-hop reasoning. And what that means is that you have the language model
Starting point is 00:05:48 with the retrieval in a loop, sort of generating queries, searching for things, summarizing the context that's found, adding it back into the queries, and sort of iterating until the language model, potentially after several hops, finds all of the information that it needs in order to factually answer a complex question or fact-check stuff. And what emerged from there was that there is a second challenge,
Starting point is 00:06:10 which is, okay, even if you know how to get the right pipeline and you connect all the pieces, which is a big challenge, how do you train or supervise these steps? Back then, prompting wasn't a thing yet. And so you had to fine-tune all of these components. And the problem is, there is a lot of datasets sets that's like, here's this question, and here's the answer. Please answer the question. But there aren't actually a lot of data that's like, here's the question, here is the set of paragraphs you need to retrieve, here's the next query you need to generate, here is what that should retrieve as well. And then those intermediate steps, you do not have data for them. And that's actually a fundamental thing.
Starting point is 00:06:46 The reason it's fundamental is that I believe pipelines should be easy to change and tweak the way we write programs in general. And every time you come up with a new pipeline, you need new data because you might create new steps that didn't exist before in this context. So DSPy is basically our solution for that. And it took us a long time of trying a lot of different things
Starting point is 00:07:04 to arrive at the final iteration that we have now. We've released lots of this from very early on as open source. Obviously, the latest stuff is also open source. And what we arrived at is that we really want to empower people to build their own system architectures or their own pipelines where they can specify the signatures of the steps. And signature is kind of a really key abstraction in DSPy where normally function signatures, like you just specify the input-output patterns.
Starting point is 00:07:28 So some people might, for example, use Pydantic or just a Python specification, where you say, I have a string and I need the following set of integers and whatnot. But in DSPy, these are actually going to be natural language specifications. So you're going to say, I have some context and a question and I want a search query. And DSPy is going to look at these keywords in natural language and will basically work to infer what you meant in the context of your pipeline. So once you have these signatures and you connect them in your code through nothing but basic PyTorch dynamic calling. So basically, if people are familiar with PyTorch, when you're defining a network in PyTorch, you just call the steps in any loops or if statements or any dynamic control flow you want, the same as in DSPy. So very natural abstraction. So once you've sort of created your pipeline that
Starting point is 00:08:14 way, the real question is, how does each step get implemented? So I have the step that's supposed to generate search queries, let's say, or I have the step that's supposed to generate answers based on context that's retrieved from my search model. So that was the challenge about the data. Turns out that the answer is, if you can have a metric towards which you want to optimize, you can sort of treat these intermediate stages as essentially a latent variable or something you want to learn as part of the bigger pipeline. And then you can see, roughly speaking, which intermediate outputs and inputs both ascribe to the signature that you have, so they respect the constraints of your
Starting point is 00:08:49 semantic signatures, as well as lead to high quality outputs for the pipeline. And then you can sort of go back and optimize the whole system so that these examples could serve as parts of prompts, or they could serve as fine-tuning data for each of the stages of your pipeline. And from that abstraction, what we arrived on after a lot of different attempts is that this actually looks like a neural network, where you have a lot of layers. There's an input at the beginning that you have, maybe a small number of in this case. And maybe you have some finite labels at the end, or maybe it's self-supervised, where you don't need a lot of labels or any labels. That depends on your metric. And so there's essentially a loss function of some kind at the end. And there are these layers, which are actually kind of representing prompting techniques. Like one layer could be,
Starting point is 00:09:33 I want a chain of thought component here, or I want a program of thought component here. So that's like asking the model to generate code in order to solve a task. And you just connect them in any pipeline that you want, kind of like a feed-forward network. And then you have these optimizers, which are sort of where the compiler comes in, that sort of looks at whatever context your program is executing in. So whatever data you have, for example, and looks at what metric you're trying to optimize, and then treats all of these intermediate things as variables to optimize or as parameters to optimize. That's incredible. And I mean, it's really interesting, right? And in many ways,
Starting point is 00:10:05 you're building a neural network on top of all these smaller little modules. Your system helps construct that network and train the weights. For me, it's very intuitive the way that you describe it in that manner. Help me understand how Drift works. How does the framework relearn?
Starting point is 00:10:18 How does it learn new interactions as the underlying models change? And also as the inputs that I'm giving the network change. How is it learning over time? Understanding Drift and how it learns is, for me, the thing that I'm really quite interested in, and also the way it can adapt to new scenarios. Yes. So what we're doing here is that we are adding a new level of modularity that just doesn't exist in existing things. And that level of modularity is saying what you want to express is not a prompt.
Starting point is 00:10:46 Because a prompt is something that's highly specific to one particular language model. It's like when you're trying to describe your program, I don't want you to think in terms of the binary, the final exe file that you're going to produce or the final binary that you're going to run on your Linux. Because that's very specific to the operating system that you have, the hardware as well.
Starting point is 00:11:02 I want you to think of the language model as this sort of really kind of device that you can instruct, but you do not want to start instructing a particular model. So now that we've separated the program structure and the steps and the signatures from the specifics of the language model, you can then say, take my program and take my favorite language model or two or three, and then compile things for it in a particular way. This strategy of compiling is called the teleprompter. DSPy has a significant number of teleprompters.
Starting point is 00:11:30 It's easy to add more because it's super modular. Teleprompters are basically these optimizers that can take any program and a metric and some data. So data basically here just means inputs. So maybe some questions you want your system to be able to answer or some documents you want it to be able to summarize or whatever it is. Potentially some labels. Generally, we need labels for the final output so that your metric knows what to evaluate. But in many cases, you don't even need that.
Starting point is 00:11:52 Certainly, you don't need labels for the steps of your pipeline because that would break modularity. So you don't need labels for that. And we generally don't assume labels for that. By the way, a teleprompter stands for prompting at a distance. A typical teleprompter, although this is completely open-ended, what it will do is it will basically take your program, find the modules that are in there. So it'll be like, oh, there's a chain of thought module here that takes questions and generates search queries.
Starting point is 00:12:15 And there is this other module here that is maybe a React agent that takes the question and maybe one suggested answer and then calls a couple of tools and then revises the answer, for example. Now, these modules themselves have internal parameters in DSPy. And these parameters are things like, although we're adding a bunch of things there, primarily demonstrations.
Starting point is 00:12:35 So basically, when you have an agent, it's basically a prompt or a bunch of prompts that can call the language model multiple times in order to teach it how to interact with tools. So the key component there is really examples of using different tools properly. And when you're starting with your pipeline, that set is empty, right? Because no one is going to write that prompt for each particular set of tools in each particular pipeline that you're going to work with in the way that works for each possible language model. That's obviously extremely unscalable and
Starting point is 00:13:03 very brittle. Instead, a simple teleprompter might say, I'll take your program, and I'll take your language model, and I'll build a basic zero-shot prompt that I can use for each of these modules. So that's before I have any demonstrations. Now I can take your input questions or whatever input your pipeline exists, run it through the pipeline, potentially multiple times, potentially with high temperature, until I can get examples that actually work with your metric, maximize the particular metric that you have. And once I have a few of them, I can bootstrap this process. This is something I can self-iterate. I could basically take those, put them in the prompt, see if they actually help on new examples within the data that I have, which might be very small. And at the end of this
Starting point is 00:13:42 process, what we see is that pretty complex pipelines, so you could have several steps of calling language models, retrieval models, actually work really well with pretty small models. So it could take a LAMA to 13 billion parameter model, and basically it can teach itself how to simulate a React agent, which has fairly complex interactions with tools.
Starting point is 00:14:02 Or it could teach itself how to do multi-hop question answering, or how to solve math questions, etc. And what we see is that this works a lot better than zero-shot prompting, and in many cases works a lot better than writing the prompt yourself by hand. You know, if you take prompts that other people wrote in the literature or elsewhere.
Starting point is 00:14:18 And this is something that you can contrast with existing frameworks, where they can give you, hey, here's the React agent architecture, you give it the tools that you have, and what it will do is it will map to this fixed zero-shot prompt that just tells the model, just do your best, and it usually doesn't perform too well. In DSPy, that's a program that you can actually compile and get a highly optimized system for your task. I mean, this is really interesting, especially at the beginning, if we scroll back to last year, OpenAI GPT-3 comes out.
Starting point is 00:14:46 That was a moment of notice for me. We had a lot of this discussion around prompt engineering. It was this whole, we're going to have this huge innovation. We're going to have this whole new class of people. The AI analogy for the analytics engineers is going to be the prompt engineering. They're going to specialize in the ability to prompt specific models. And I think what's interesting about DSPy, it sounds to me like, one, we're lowering the barrier to entry.
Starting point is 00:15:08 So you don't actually have to have a human that particularly understands the specifics of this model. It's a layer of abstraction above. So how are you making it easier for people to build deep-learn, LLM-based applications? Are you really actually making this simpler and therefore democratizing access to these systems? What do you think this is doing and how is this going to make it easy for all of us?
Starting point is 00:15:29 Or is it going to make it easy for us to build LLM-based apps? Great question. I think what we're doing is a bit more general than democratizing this. I think we're just fundamentally changing the way you approach it. And the reason for that is, you made this good point about you do not need deep expertise in writing prompts for particular models when working in DSPy. And that's true. But I'm actually going to go farther and say, it's not that you don't need it. It's just that that's their own question to ask. Because language models are great at a lot of things. But what they reliably convince us, we see this all the time,
Starting point is 00:16:06 is that they just can't get consistent reliability on bigger complex tasks. They can do small hops of reasoning. You can give them a document and ask them to extract something. And with good models, you can get really good quality there. But if you're just going to ask them open-ended questions about really niche things in the world, that's not going to cut it. And so the issue isn't really prompting, although that's kind of an annoying thing that we have to get over. The real question is, what is the structure of my problem and how should I build a system architecture? How should I do good software engineering such that I can take some user input or whatever it is that I want to process and get to the final destination of the output that I want to produce.
Starting point is 00:16:48 And our sort of premise here is that invariably for real complex tasks, that's not one language model call where you want to get the best prompt. That's a lot of back and forth between different components. Maybe I'll call the language model once or twice, maybe more,
Starting point is 00:17:01 but there's definitely more processing going on there. And once you sort of frame it like this, it's not a question of like, can we make prompt engineering easier or harder? It's just infeasible to do good prompt engineering at that complex scale for a pipeline. You know, you could pick your favorite other framework that exists now and then open your file structure. And what you'll see is like tens of files called prompt.py. And what you see is like, here's the SQL prompt for SQLite or for MySQL or for other things. And for each database,
Starting point is 00:17:29 there's like a slightly different prompt that does that. Now for math, there's a different thing. But what if I have a pipeline that needs to do a little bit of math and then some SQL? Do you just connect these two? Like, is it just going to work for all models? Yeah, it's not going to do that.
Starting point is 00:17:41 Not reliably, at least. So what you're going to have to start doing is like tweak these two prompts together and then the provider, maybe OpenAI, changes their model a little bit and everything breaks. But if you start thinking of like, well, I just have a pipeline that does this math thing and then does this call to the model to generate SQL, and then I can sort of give that to the tool, which might complain
Starting point is 00:18:01 if the SQL is bad, and you sort of build that as a pipeline like this, the model changes, the right, if the SQL is bad. And you sort of build that as a pipeline like this. The model changes, the data changes, the SQL database changes. Cool, I'll just recompile. You know, internally during compilation, it'll discover all the issues, it'll iron them out. And then you'll get this new compiled sort of artifact that you can save, right, for reproducibility and just for good software engineering.
Starting point is 00:18:20 And then you can load it and it works with, you know, so long as your system components are fixed. So does this make things easier? There's a learning curve. You know, you kind of got to learn all these new components, which are like three things, right? There's the signatures, there's the modules that we have, which can, you know, take these signatures and learn how to prompt. And then there's these optimizers or these teleprompters that you can use. So, you know, it might take you a day or two if you're already familiar with like some of the general prompting things here.
Starting point is 00:18:44 But once you've done that, it just completely changes the way you think about building these things. Now, obviously, a lot of folks are starting to realize, you've got to be AI engineers, not prompt engineers. I like that framing. And a lot of this, I think, in the near future is going to start to be about, hey, how do we get the right pipelines, not just for solving tasks, but for supervising tasks, which is the kind of problem we've been working on for the last three or four years for these kind of multi-stage systems. And we have a lot of rich stuff in DSPy
Starting point is 00:19:11 that could basically do that. So hopefully this answers the question. Yeah, I think that's probably a perfect segue. We're going to talk about our next spicy future section here. Spicy futures. Obviously, there's all these new primitives you're introducing, but it should and it should be way better
Starting point is 00:19:31 output and quality for users. But given your point of view and you've done a ton of work in this space and also probably interact with a lot of people already talking to you about, okay, how clean leverage is, what are future work and stuff like that. Maybe we'll ask you, what do you think will happen in this AI engineer space in the next three to five years, right?
Starting point is 00:19:52 Do you see prompting just goes away? Do you see sort of framework approach, declarative approach to overtake the world? What do you think will happen in this AI engineer space? And how do you think are the things in sequence will need some breakthroughs to make that happen as well? Okay, these are three great questions. So help me if I forget any of them. The first one is like my hot take on the future of this. And I'm convinced that the right approach for all this stuff, the research especially,
Starting point is 00:20:21 but certainly a lot of the products that are going to stand out are going to be about this transition to thinking of these pipelines as very similar things, as very similar artifacts to neural networks. So it's going to be a lot less about the model. It's not going to be a GPT-4 versus LAMA. It's going to be what is kind of the right sequence of solving these problems. And we're going to start thinking about general purpose ones of these. So agents is a great case there. But agents are jumping way too far ahead because we can't even get a lot simpler structures that are more deterministic to work.
Starting point is 00:20:56 So starting with agents is kind of too soon to get the general purpose agent working. But this sort of is vaguely in the right direction. So I think what's going to happen is we are going to see a lot less focus on like prompting tricks and a lot more focus on general purpose reusable modules that can be sort of reused together as building blocks in solving these tasks and new optimizers for full pipelines. Basically much richer versions of what we already have in DSPy, which hopefully can sort of already be integrated
Starting point is 00:21:26 in the existing abstraction that we have. That's on one hand. On the other hand, what I think about prompt engineering, working with strings and language models directly, I don't think that's going to go anywhere, but it will be a lot less mainstream in the sense that right now there's a sense, and it might be true if you don't use DSPy,
Starting point is 00:21:44 that everyone who's building things with language models needs to have a pretty good grasp on how to make sure you get around how brittle they are with prompts. And I don't think that will sort of reach the place where everyone actually needs to know it really well, in the same sense that not everyone needs to know assembly. But I actually bet that some people need to know assembly so that when we build compilers, we kind of get it right.
Starting point is 00:22:08 You still have to talk to the machine at some point. You can bootstrap, you can use your compiler to compile your compiler. But at some point, somebody got to understand and use this to either discover new things. We're still discovering new capabilities for language models. And I don't think that should stop. As well as just get the best possible compilers, like what we have in DSPy.
Starting point is 00:22:25 That requires some level of meta prompt engineering, if you will, like across all tasks and across models potentially. So that's in terms of prompt engineering. But what I think will happen is not everyone needs to do prompt engineering, that's for sure. But actually not even everyone needs to learn something like DSPy per se, because what will happen is this stuff will be more democratized to the extent where higher level abstractions above things like DSPy, what they will mean is
Starting point is 00:22:51 there is going to be this general purpose program, not a bunch of prompts that I wrote or someone else wrote, but it's a program that could learn its prompts. And then these higher level abstractions could say, just give me your data. Internally, I'll compile the DSPy program. It will get to learn its own prompts for your data
Starting point is 00:23:07 without you knowing about how that works or what's going on internally. And you'll get this pipeline that's highly optimized for efficiency and for quality on your use case. And for 90% of the use cases, that's probably good enough. Now, a fixed prompt that's trying to do everything for everyone is not good enough. But a program structure that can be compiled on your data for typical problems like chat with your PDF or these things that are recurring, I think that will just be a really are many of us who are building new things and sort of like thinking about solving problems with more sophisticated and
Starting point is 00:23:49 more interesting pipelines, because obviously, this is very, very early stages. One way to think of this is we're sort of like approaching the ResNet or even AlexNet, if you will, moment of neural networks. We're like, oh, we can actually put these things in layers and get really better quality through the depth, you depth, if you think of deep learning. And what we're saying is like, oh, here's PyTorch or TensorFlow or whatever. Here's a framework that can actually allow you to think about these things in that general way, as opposed to hacking together some C++ or some MATLAB or something. And a lot of people don't need to build
Starting point is 00:24:19 BERT from scratch or GPT-3 from scratch. They could just download the architecture, but many people need to actually code it. And so that's the abstraction stack that's going to emerge. I'm super interested to get your take on when you go to build a piece of software today, you're a Node developer, JavaScript developer, Python developer, you're NPM installing,
Starting point is 00:24:38 you're pip installing a module. Do you envision a world where the future of AI engineer is I'm going to pip install DSPy, I'm going to pip install these DSPy modules, and then I might write some of my own teleprompters, some of my own modules that wrap around specific things, bring my own model, if you will. I'm curious, how do you see this ecosystem evolving?
Starting point is 00:24:59 And then also, how do you think commercialization, so we have big compute providers charging lots of money to give you access to their trained models and to the private data they've used to create those models. I'm kind of curious to get your view of how you see the industry evolving and how that impacts what a future AI engineer may use to actually build and solve a problem. After a long time, this was not quick.
Starting point is 00:25:22 I've come to realize that it's really uncommon to have a good reason to break away from this neural network slash PyTorch in particular way of thinking of things. So I think something like DSPy, I think DSPy in particular, might become a dependency in a lot of these things. A lot of infrastructure can still exist around that.
Starting point is 00:25:42 So you still need good models that could serve as the starting point for sort of bootstrapping these processes and compiling as well as deploying them. But what's going to happen is that these models will be a lot more replaceable. What exists right now is that there's this huge lock-in. If you have these prompts optimized for GPT-4
Starting point is 00:25:58 and it's kind of like quite capable and whatnot, you're kind of stuck with that. But what we're saying is when there's an automatic optimizer that could try a lot more things than you as a person can, and that could go through automated processes, it's not like just a language model is doing magic. It's like there is a sort of systematic optimizer that is going through an algorithm there. You can sort of automate a lot of the hill climbing such that you could compile a T5 model that you could run on your CPU.
Starting point is 00:26:22 And for many sort of pipelines basically do as well as a really large, expensive model. So what's going to happen, I think, is just broader scope of being able to offload calls of language models of various sizes and kind of much higher variety as parts of programs, parts of these chains that people are building. And that comes with all things around monitoring and tracing and whatnot. And for the people building programs with these language models, increasingly, I think, especially with the abstraction of DSPy, where I guess it's what's called defined by run, meaning you just write your code,
Starting point is 00:26:59 and then it can actually look at which places you call the language model in and sort of under the hood figure out how to do the fine tuning and prompting. What we're going to see is, I think, increasingly the gap between just general programming and programming with models is going to be blurred. So people will just start to use these modules in their normal code to do general verification, validate user input, and do other things. But it's not going to be by writing a prompt. It's going to be through programming. So that's the whole message of programming foundation models
Starting point is 00:27:29 and not prompting them. I'm very curious, how do you see DSPy and maybe even related? Because I think DSP happens, DSPy, almost like abstraction on top of it. Based on even your description, we're assuming there's going to be even more abstraction maybe on top of this, right? Yes.
Starting point is 00:27:44 To make it even more easier. So I think what will be the progression here? Because looking at DSPy today, obviously it's early. Looking at notebooks, you have to kind of pick your teleprompter, pick a lot of things. You have to know intuitively everything that's there to be able to kind of piece together the structure of how you want to be able to use the framework. But, you know, eventually it seems like we want to move forward with a much simpler model, hopefully for the majority use case, like 90% of that.
Starting point is 00:28:11 How, I guess? Where do you see the next level of work will happen? Do we need to make DSPy have a lot more sort of like general optimizations and then we have something to express it over even more? Where do you see that happen, maybe in the short term? Yeah, I mean, in the short term, I think DSPy is just expanding the capabilities
Starting point is 00:28:28 of what we can do. And it's kind of targeting the large number of people, although it's not the majority, who are upset with the way prompting looks and want to build powerful things in a more systematic way. And I have to recognize that a lot of application builders,
Starting point is 00:28:44 they're not trying to build the best language model pipeline. They're trying to build an application that's facing a user. And the details of how to build the best pipeline is not their main focus. So for that, I'd like the stack that we have in neural networks where something like TensorFlow or PyTorch, right? There's above that something like Hugging Face Transformers. And Hugging Face Transformers doesn't tell you like,
Starting point is 00:29:04 here is how you build your whole new transformer from scratch and give it new layers that are completely novel. No, it just tells you here's BERT if you want to use BERT, here's GPT-2 if you want to use GPT-2. If you want to do some lightweight quick fine-tuning on top of BERT, sure, it's easy to do it this way. But at the end
Starting point is 00:29:19 of the day, it's this kind of higher-level wrapper around PyTorch, mainly, that gives you sort of off-the-shelf common models as something you could simply reuse as part of your application or kind of higher level research. And I think that's exactly what we're going to see here. I should add that the most popular set of frameworks that existed over the past several months and so are trying to act at the Hugging Face level. But the problem is there is no PyTorch beneath them. It's kind of like implementing Hugging Face level. But the problem is there is no PyTorch beneath them. It's kind of like implementing Hugging Face
Starting point is 00:29:47 transformers, if folks are familiar with that, using normal Python. So under the hood it's extremely inextensible. It's like trying to do loops instead of calling neural network layers or hard coding the weights of your neural network instead of just optimizing with
Starting point is 00:30:03 your favorite optimizer, like Adam or SGD or something. What I think will happen, and we're talking with people, is like, hey, you already have this kind of cool high-level abstraction. Do you want to build things in it using the abstractions we have in DSPy as a way to make development itself a lot more general and faster, but also as a way to make it a lot more adaptable so people could take your high-level pipeline and just do this high-level recompilation
Starting point is 00:30:28 and get a better program without relearning sort of all of the internals of DSPy. Although I should add that for someone who's thought about these pipelines, I think DSPy comes very naturally because there's basically like six or seven modules that you could pick from at the moment, which might sound like a small number, but if you're thinking about layers of neural networks, there's not too many. Like, you know, you could have dropout, linear layers, convolution layers,
Starting point is 00:30:51 you know, recurrent layers. There's not too many things, especially the key ones, you know, that you can reuse as building blocks. And there's a few optimizers or teleprompters that you could use from. And then you do your thing. You can structure them in any general Python code
Starting point is 00:31:03 that you want. Use your loops, exceptions, whatever it is. Just call the right modules in the right places. It's defined by run. And so I think that stack is going to really change the way we think about these programs. That's really interesting. I'm curious, if you had to suggest for someone who's a complete newbie to LLMs and AI and what could be built today, where should they start? Where is the place to dive in? And then what would you tell them they should go build and try?
Starting point is 00:31:29 What are they ultimately trying to do? Is it just like learn and get a good... Trying to learn, understand what impact all of this could have on the business they're building. Maybe they work at Cisco or work at some large enterprise. And they said, go figure out AI. And they stumble across this podcast, they stumble across your work, and they're really interested. Where should they get started? I think it's useful to understand this potential emerging stack. So I'd say DSPy is exactly in the middle, where we're sort of
Starting point is 00:31:59 giving you these high-level primitives that you can compose. And the whole abstraction is like, I mean, if you look at our code, it might be around like 3,000 lines of Python that we can refactor to be even less. It's a very small framework that can do a lot of things. So I think it's a good place to start to sort of understand that this is not about language models. This is about building these high-level pipelines and programs and understand what kinds of moving parts are important.
Starting point is 00:32:22 Then you want to probably go up a level and understand, well, what are people actually building with language models at the application layer, which is probably where someone just looking into this stuff might want to jump in. And so look at the agents in Lanxchain, look at the chains that they have built in. And then you probably want to go a step beneath all of these and look at like, okay, what if I wanted to write the prompts
Starting point is 00:32:43 myself manually? And I think with the understanding of these three layers of the stack, the device itself being the language model, the programming model for sort of automating interactions with them, and then like kind of prepackaged chains around all of that, I think you'll be in a place to actually select the layer you want to solve problems at. So if your problem is already solved by a high-quality chain that someone else built
Starting point is 00:33:07 and you validate that the cost is fine with you, the quality is fine with you, maybe that's all you need. And I think most of the time will be the case. It turns out you need to customize this, make it more cost-effective, make sure the quality is better, be able to iterate over time to fix issues
Starting point is 00:33:20 and adapt it more. I think you basically only have DSPy and drawing your own thing right now if you want that level of control and adapt it more. I think you basically only have DSPy and drawing your own thing right now if you want that level of control and iterative development. And so just understanding those three layers of the stack gives you options. Cool. So one last question.
Starting point is 00:33:36 It's super exciting that you are really changing a lot of the paradigm. It was great that you also in the readme have a lot of sections comparing why you should use DSPy versus, you know, just general prompting and link chain
Starting point is 00:33:49 or some equivalent frameworks. And really one section that really kind of like stood out to me is like, hey, instead of the generic hard-coded prompts, DSPy doesn't contain
Starting point is 00:33:58 any specific prompts for you, right? It learns how to optimize. It learns how to generate it. I think to be able to optimize anything well, you should evaluate something well, right? It learns how to optimize. It learns how to generate it. I think to be able to optimize anything well, you should evaluate something well, right? I need to be able to know I'm making progress because otherwise I could be optimizing to something. But today, even this evaluation step is very hard to pinpoint. I wonder how you think about that part of it,
Starting point is 00:34:20 because I feel like that's a very crucial part. How do you make the evaluation part work? So this is a difficult problem inherently. What DSPy offers is the right framework to think about it as an iterative process that will keep getting better over time. So we have this notion of metrics. And one thing we've been doing for a while is like building metrics that are themselves optimized DSPy programs. So you want to check, for example, you're trying to generate questions that have a particular quality, you'd have a program that checks that your questions have that particular quality. And optimizing a program of that kind is actually a lot easier than optimizing the original program,
Starting point is 00:34:58 because it's kind of a binary label for the metric itself. So what we give you here can start with a small data and a simple program and a basic metric. And as you sort of start collecting data and eyeballing examples, what you will see is where your metric is not perfect and where your data is lacking and where your program falters. And you could basically iteratively improve
Starting point is 00:35:22 each of these three components over time, not as like seeking the perfect thing right away, but as like, here's the basic program. I'll compile it with a simple metric. Actually, turns out I needed to maybe have a little bit more input questions here. So I'll get some from putting this out in a demo or something. Turns out now I can improve the metric itself to optimize the program towards a more aligned view that I have of what the program should be doing.
Starting point is 00:35:47 And you can sort of isolate the thinking between these three stages, as opposed to the much more common approach of like, let me tweak this prompt a little bit and see if things look better now. That is the framework we want people to think in. Awesome. Well, thank you so much, Omar, to have on our podcast. I know we have a ton of stuff we can even go for, but this is exciting and there's so much work to do. Where can people find you and where can people find more about the project?
Starting point is 00:36:11 You can find DSPy on GitHub and you can just Google my name. You'll find my website, email me or open an issue on GitHub. Just write DSPy or DSPy Stanford, you should easily find it. Thank you so much, Omar. Yeah, thanks, guys. It's a pleasure.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.