No Priors: Artificial Intelligence | Technology | Startups - Open sourcing AI app development with Harrison Chase from LangChain
Episode Date: March 28, 2024Companies are employing AI agents and co-pilots to help their teams increase efficiency and accuracy, but developing apps that are trained properly can require a skill set many enterprise teams don’...t have. This week on No Priors, Sarah and Elad are joined by Harrison Chase, the CEO and co-founder of LangChain, an open-source framework and developer toolkit that helps developers build LLM applications. In this conversation they talk about the gaps in open source app development, what it will take to keep up with private companies, the importance of creating prompts that can be compatible with many API models, and why memory is so undeveloped in this space. Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil |@hwchase17 Show Notes: (0:00) Introduction to LangChain (1:45) Managing an open source environment (4:30) Developing useful AI agents (10:03) Sophistication and limitations of AI app development (14:17) Switching between model APIs (17:10) Context windows, fine-tuning and functionality (21:37) Evolution of AI open source environment (23:53) The next big breakthroughs
Transcript
Discussion (0)
Hi, listeners, and welcome to another episode of No Pryors.
Today, we're talking to Harrison Chase, the CEO and co-founder of LangChain, a popular open-source
framework and developer toolkit that helps people build LLM applications.
We're excited to talk to Harrison about the state of AI application development, the open source
ecosystem, and its open questions. Welcome, Harrison.
Thanks for having me. I'm excited to be here.
Langchain's a really unique story, and it started actually as a personal project for you.
Can you talk a little bit about what Langchain is and what it was originally?
Yeah, absolutely.
So how I would answer the question, what Langchain is, has kind of evolved over time, as has the entire landscape.
Langchane, the open source package, started, yeah, as a side project.
So my background's in ML and MLOPS.
I was at my previous company.
I knew I was going to leave.
I didn't know what I was going to do.
So this was in September, October of 2022.
And so went to a bunch of hackathons, a bunch of meetups, chatted with folks.
They were playing around with LLMs and saw some common abstractions, put it in a Python project as a just fun side project.
Turned out to strike a chord, the fantastic timing, you know, Chad GPD came out like a month later.
And it's kind of evolved from there.
So right now, Langein, the company, there's really two main products that we have.
One is the Langchain open source packages and happy to dive into that more.
And then the other is Lang Smith platform for testing, evaluation, monitoring, and all of those
types of things.
And so, you know, what Langchain is has evolved over time as the company's grown.
One thing that we talked about the last time we saw each other in person was just how quickly
like the AI ecosystem and research field is evolving and what it means to manage an open source.
project through that. Can you talk a little bit about what you decide to keep stable and change when you
both have big ecosystem of users now and like very rapidly changing environment of applications
and technology? That's been a fun exercise. So I mean, if we go back to the original version of
Langchain, what it was when it came out was essentially three kind of like high level
implementations. Two were based on research papers. And then one was based on Nat Friedman's like
NatBot type of agent web crawler thing. And so there was some high level.
kind of like abstractions.
And then there was a few like integrations.
So we had integrations with, I think, like, open AI,
cohere and hugging face to start or something like that.
And those two layers have kind of like remained.
So we have, you know, 700 different integrations.
We have a bunch of kind of like higher level chains and agents for doing particular things.
I think the thing that we've put a lot of emphasis in, to your point around kind of like
what's remained constant and what's changed is like a lower level kind of like
abstraction and runtime for joining these things together.
One of the things that we pretty quickly saw was that as people wanted to improve the performance, go from prototype to production, they wanted to customize a lot of these bits.
And so we've invested a lot in a lower level kind of like chaining protocol, so laying chain expression language.
And then in a different protocol laying graph, which is something we're really excited about.
And that's more aimed at basically graphs that are not DAGs.
So, you know, all these agents are basically running an LLM in a loop.
you need cycles. And so Langraph helps with that. And so I think what we've kind of seen
is the underlying bits of, there's all these different integrations. And like, you know,
there's LMs, vector stores, and sometimes they change, right? When chat models came out,
like that was a, that's a very big change in the API interface. And so we had to add a new
abstraction for that. But those have, especially over the past few months, remained relatively
stable. We've invested a lot in this underlying runtime, which emphasizes a few things,
streaming, structured outputs, and the importance of those has remained relatively stable.
But then the way that you put things together and the kind of like patterns for building
things has definitely evolved over time from like simple chains to complex chains to then
these kind of like autonomous agents to now something maybe in the middle of like complex
state machines or graphs or something. And so it's really that
upper layer, which is like the common ways to put things together that I think we've seen the most
rapid kind of like churn.
What do you think is still missing from really getting to perform in agents?
There's a number of companies that have been started recently that are really focused on sort
of the agentic world and pushing that whole thread in certain types of automation forward.
What do you use of the big components that you all don't have or that maybe the industry more
generally doesn't have that this still needs to come into place to help drive those things ahead?
Yeah, that's a really good question.
I think there's a few things.
one I think like figuring out the right ux for a lot of these things is still an open question in my mind
and you know that's not necessarily something we can help with I think there's a lot of
exploration that applications need to do to figure out how to you know communicate what these agents
are good at and bad at to end users and expose ways to maybe let them course correct and see what's
going on and so you know I think we try to emphasize a lot of this observability of intermediate steps
and even correcting intermediate steps,
but there's a lot of experimentation around UX
that I think needs to happen.
Another big part, I think, is basically the planning ability
of the underlying LLMs.
I think that's probably the biggest,
I think when we see people building agents that work right now,
it's often breaking it down into a bunch of smaller components
and kind of like imparting their domain knowledge
about how information should flow through these components.
because I think the alums by themselves still aren't able to reason fully about how that should happen.
And I think we see a few kind of like a lot of research is actually around this.
I would say in the academic space.
Specifically, I think there are two different types of research papers around agents that we see.
We see some around like planning for agents.
So there's a bunch of papers that do kind of like an explicit planning step up front.
And then there are other research papers that do a bunch around reflection.
So, like, after an agent does something, is this actually right? How can I kind of, like, you know, improve upon that? And I think both of those are basically trying to get around the shortcomings of LLMs and that, in theory, they should do that automatically, right? Like, you shouldn't have to ask an LLM to plan or to think about whether what it's done is correct. It should know to do that and then it can kind of like run in a cycle. But we see a lot of shortcomings there. And so I think planning the ability of LLMs is a big one. And that will get better.
time. The last one is maybe a little bit more vague, but I think even just as builders, we're
still figuring out the right ways how to make all these things work. What's the right information
flow between all the different nodes in order to get those nodes, which are typically an
LLM call to work, do you want to do few shot prompting? Do you want to fine-tune models? Do you want
to just work on improving the instructions and the prompt? And so I think there's a lot of
how do you test those nodes? That's a big thing as well. How do you get confidence in your LLM
systems in LLM agents.
And so I think there's a lot of workflow around that to kind of like be discovered and
figured out.
One thing that sort of come up repeatedly relative to agents has just been like memory.
And so I wasn't sure how you think about memory and implementing that and what that
should look like.
And because it seems like there's a few different notions that people have been putting
forward.
And I think it's super interesting.
So I was just curious about your thinking on that.
I also think it's super interesting.
I have a few thoughts here.
So I think there's maybe two.
types of memory, and they're related, but I'll draw some distinction between kind of like
system-level procedural memory and then like personalization type memory. So system-level memory,
I mean more like, what's the right way to use a tool? What's the right way to accomplish this
objective, independent of who exactly the person is and how I'm different than Sarah or something
like that? And then for the personalization bit, I think it's like, okay, you know, Harrison likes
soccer and he likes basketball, and I should remember that when he asks questions.
And so I think there's maybe slightly different ways that we see teams thinking about both of these.
So on the procedural side, I think the main thing that we see people doing and that we think is
pretty effective is few shot prompting and maybe fine-tuning for how to use tools,
because that's basically what it comes down to.
What's the right way to use tools?
What's the right way to plan?
And we see few-shot examples being really, really impactful for that.
And so that's something where, and so I think there's just really interesting data flywheel
of like monitoring your application, gathering good examples, and then plugging those back
into your application, the form of few shot examples that we're pushing really heavily with
Lange Smith right now. And then the other side of it is this like personalization level memory.
And I think there's a few different ways to do this. Like I think OpenAI implemented it in their
chat in their chat GPT where in the way I think it does it under the hood.
is it basically has functions that it can call to say, like, remember this fact or delete this
fact. And so that's a really interesting, like, active loop that the agent is engaging in, where it
explicitly decides what it wants to remember and what it doesn't want to remember. I also think
one thing that I'm bullish on is a more kind of, like, passive background process that kind of
looks at conversations and almost like extracts insights. And then you can use those insights in kind of
like future conversations. And I think there's pros and cons to each. And I think it speaks to the
memory in general, I feel is like a field that's just like super, super nascent. Like I actually
am underwhelmed at the amount of like really interesting stuff that's going on there. And so I
think it's, you know, a bunch of different approaches. No, no kind of like overwhelmingly best
solution. Has the sophistication shape type of application that you see people building with
Lank Chain are just generally in the ecosystem dramatically changed over the last few months.
I do think that there are more examples, kind of as Alad mentioned, of agentic applications
that are much more productive and more sophisticated, like, multi-step rag systems with much more
useful ranking.
Like, does that match with the patterns you're seeing?
Or, like, what are you seeing that excites you the most that you think is most useful?
That does generally match.
I think Langein from the beginning has always been focused on those types of applications.
and not only the open source, but also Lange Smith, the platform.
So I think a lot of the emphasis that we put into the testing and the observability is really
focused on these multi-step things.
We've always been focused on those.
Probably it's generally true in the market that there's been more of a trend towards
those.
But from our perspective, we've always been focused on those.
And so I think that hasn't been as dramatic.
I think there has been like interesting things within that that have emerged, just calling
out like a few things. Within RAG, I think we've seen really interesting and advanced query
analysis start to come into play. So, you know, you're not just passing the user question
directly to an embedding model. You're maybe doing some analysis on it to figure out which
retriever should I send it to or like, what is the bit that I should search? Is there kind of like
explicit metadata filter? So some, and then so now retrieval is like a multi-step process and more
there. And it's explicitly around query analysis. Few shot prompting.
and that whole data flywheel, I think we're starting to see come into play more on the agent side.
I kind of alluded to this earlier, but I think, you know, the way that we've kind of thought about
things is there's kind of like chains, which are sequential steps.
You're going to do this, and then you're going to do this, and then you're going to do this,
and you're always going to do those in the exact sequence.
And then, you know, last March or April or whenever auto-GPT came out, and it was like,
we're literally just going to run this in a four loop, and it's going to be, you know, this autonomous agent.
And I think the things that we see making it into production and informed a lot of the development of laying graph is something in the middle where it's like this controlled state machine type thing.
And so we've seen a lot of that come out recently.
And so I'd maybe call out that as like one thing that we've really updated a lot of our beliefs on over the past few months.
Yeah, I think a combination of that and tree search and just like trying to be efficient with like your sampling at every step has shown like a lot of.
really interesting, effective applications recently.
And I think the, like, cognition as one example of, like, a surprisingly amazing agent has come
out, like, where else do you think agentic applications will begin to work, or that you've
already seen?
I think on the customer support side, that's a pretty obvious use case.
I think Sierra, you know, has emerged there and is doing quite well there.
I think, yeah, the cognition demo was very impressive.
I think they did a lot of things, right?
I think they really nailed a really interesting U.S.,
and that was maybe one of the things that I was most excited about.
And then obviously it seems to work very well,
and so I don't know exactly what they're doing under the hood.
But those types, like coding problems in general,
we see a lot of people working on.
I think there's a really nice feedback loop that you can get
by just executing the code and seeing if it works,
and as well as the fact that people building it are developers,
and so they can test it.
Coding, customer support.
There's some interesting stuff around like recommendation chatbots almost.
So I draw a distinction between that and customer support.
With customer support, you're maybe trying to explicitly kind of like resolve a ticket or something like that.
And the recommendation bit is a bit more focused on like a user's preferences and what they like.
And I think we've seen a few, I think we've seen a few things emerge there.
But I'd say customer support encoding are the two.
Klarna as well, you know, they came out and had a pretty good release.
One pattern that I think is very popular and I can't tell if it is real or transient is whether
or not companies will be able to switch between different LM models, right?
whether it's a, you know, self-hosted, like, dedicated inference, you know, instance for
them or if it's an actual API provider.
But for any given application, take your prompts and go from, you know, anthropic to
mistrawl to open AI to something else.
In reality, it feels like, you know, the way an application response is probably going
to be sensitive to the fact that these LMs are actually going to.
to predict differently? Like, what do you think about this? Can you switch? Is that a real pattern?
It's not as easy as it seems like it should be. And I think the main thing is that the prompts still
need to be different for each model. I do think the prompts will probably start to converge in the
sense that if you think the models are getting more and more intelligent, then like hopefully
these small idiosyncraticis don't matter as much. And as more and more model providers start
supporting the same things, then that will make it easier. And what I mean by that is, you know,
so many prompts for OpenAI, which is, you know, deleting the most used one, use function calling.
And, you know, up until some period ago, like, no other models did. And so you just, like,
couldn't use this prompts at all. But now, like, Mistraw has function calling and Google has
function calling. And so I think they're a little bit more transferable there.
What else is on that list? There's function calling. There's visual input. Like, what else is going to
differentiate these model APIs. Context Windows one as well. So I think this gets to like,
yeah, what's the right context that you can be passing? If it's longer, you know, if that changes,
then that changes, that doesn't, like, that changes the whole architecture of your application.
Modalities one. Prompt injection for safety. Yeah, I think that's interesting. I think that's a real
enterprise concern. I think a lot of the agents are still just figuring out how to make
agents work. This is a different axis almost, but to the point around like switching models,
I do think we see a desire for this, especially when you start going to scale. So I think it's like
make something work with GPD4, but then, okay, you're rolling it out. Are you really, is that like,
are you really going to eat that much cost with GPD4? Can you use GPD 3.5? Do you want to fine
tune? And so I think that's that like that transition is where we really start to see people thinking
about switching models. There's definitely some switching models at the beginning, like if you just
want to play around with different models and see their capabilities. But I think the most pressing
need to switch models happens when you go from prototype to scale. Cost and latency would be
differentiators there as well. One thing you mentioned I thought was really interesting is just
context windows. And obviously, Gem and I launched with a million token context window. And I was just
curious how you think about context window versus RAG versus other aspects of the model and how all
those things tie together. And, you know, once we get to very long context windows and the tens
of millions of tokens, like, does that really shift things radically or how does that change
functionality? And so I was just curious, since you've thought about how all these things piece
together, I was just curious how you think about those different factors and what they mean.
Very good question that a lot of people are thinking about who are a lot smarter than me.
I think, I mean a few thoughts. I think, like, longer context windows definitely make, like,
single-shot things much more realistic.
Like extraction of elements in a long PDF.
You can do that one-shot.
Rag over a single long PDF or like five long PDFs.
Okay, cool.
You can do that.
You can do that one-shot.
I think there are definitely things at scale that don't fit, you know, into a single context
window.
There are also things where it requires iterations.
You need to like decide what to do, interact with the
environment, get that back. So this whole idea of chaining and agents, I don't, like, that's less
around context windows and more around interacting with the environment and getting feedback. And so I don't
think that's going anywhere. I think with respects to RAG in particular, because I think that's
where it often comes up, like, you know, did this kill RAG? I think there's a few things. Actually,
just today, one of our team members, Lance Martin, there's that, like, everyone's doing the needle in
the haystack thing, and now all these models are like green across the board for whatever reason.
They've all figured it out.
But I think, like, that actually really doesn't reflect a lot of RAG use cases, in my opinion,
because, like, that's, the needle in the haystack is like, okay, given this long context,
can I find a single information point?
But oftentimes, rag is about seeing multiple information points and then reasoning over them.
And so I think, well, the benchmark he released is exactly that.
Like, as you increase the number of needles, you know, performance goes down, as you might expect.
And then also when you ask it to reason rather than just retrieve,
performance drops as well. And so I think there's more work to be done there. And then I think
another thing is just around the ingestion for RAG in the indexing process. Like a lot of attention
has been paid to like text splitting and chunking and all of that. And I don't know exactly
how that will change. Like will you still do that but you now just retrieve the whole document?
Like we have a concept in link chain of like a parent document retriever which basically
creates multiple vectors for each document. So maybe you just do that.
Maybe you still, maybe you chunk it up into larger chunks and just retrieve those larger chunks.
Maybe use a traditional search engine, like elastic search or something.
I'm not, I'm not sure.
That's probably the place I have the least confidence in.
The one other area that I see a lot of people talking about and I see if you're people actually doing is fine tuning.
And to some extent, I think that's because with fine tunes, you lose generalizability.
And so people just start focusing on prompt engineering or other ways to effectively get the same performance without the actual fine tune.
But it's something that feels very awkward on and people talk about it a lot and people talk about doing it a lot.
You probably have a great perspective since you see so many different types of customers.
Are you seeing a lot of fine-tuning happening in the wild?
And if so, there's specific common applications or use cases for it.
We see people experimenting with it.
I think the only real place where they're doing it is when they've reached really critical scale,
which I still don't think is that many applications to date.
I think there's a lot of difficulties with it.
One's like gathering the data set for it.
And so I think a lot of the things we have in Ling Smith tackle a lot of these issues,
but like gathering the data set for it.
So like having that data visibility and starting to curate that data set,
evaluating the fine-tuned model.
So like evaluation and testing is a huge pain point there that we're trying to tackle in a few ways.
The third is just like, yeah, back to this point of people are still just like experimenting so rapidly.
It's much harder to change a fine-tuned model than it is to change a prompt.
even changed few shot examples. And so I think we're seeing more and more people use few shot examples,
but not a ton graduating to the fine-tuning, just because, yeah, I think much harder to just
iterate quickly on. In terms of other major changes in the landscape, it's been a big year. The first
commit to laying chain, I think, was in October of 22, which is like when I launched
conviction as a fund as well. At that time, we didn't have law and two. We didn't have mistrust.
all. There were not nearly as many open source models with what people would consider to be a more
useful reasoning ability. Has that changed in terms of like what you see application developers do
with link chain? Gemini too. Oh, and Gemini, yeah. Fun story about that. The original models that
we launched with Open AI actually deprecated like a month ago. So the like actual original link chain,
you can't run because the models don't exist anymore. But
Yeah. Like, there's, I think we see increasingly interest in open source, but the reasoning
abilities are still just like lagging behind Claude 3 or GPT4. And I think like for a lot of the
applications that, it kind of probably depends on the types of applications that you're
building, but a lot of the applications that link chain is focused on with this kind of like
reasoning aspect, those are just so crucial. And I don't think we see.
super compelling. I still don't think we see super compelling reasoning abilities in the open source
models. And maybe that's one of my hot takes, but I think for a lot of the link chain apps, the
open source models maybe don't live up to a lot of kind of like the Twitter hype or Twitter
excitement, at least not yet. Zooming out, like you have a really broad view. Like, what do you feel
like that no one is working on that's going to enable better applications that should be?
I think the most exciting stuff is at the application and UX layer right now.
I think that's where the most exciting stuff is there.
One of the, I don't know if this is, this is maybe, this isn't more the capabilities side-ish,
but like memory I think is super interesting, especially like personalized long-term memory.
I don't know if, I don't know if it's necessarily tooling so much that needs to be built there
as it's just like an application in a UX that's really focused on that.
And, you know, if I wasn't doing link chain, if I was starting a company right now, I'd probably start something at the application layer, and it would probably be something that really takes advantage of, like, long-term memory.
I guess at the high level similarly, is there anything that you view as, like, a major prediction or things that'll change over the next year that nobody's really paying as much attention to?
Memory is a big interest of ours, and so I hope that we'll have some kind of like breakthroughs there.
I think a lot of the, specifically around, yeah, learning from interactions, incorporating that back in at a user level.
In a similar vein, also this type of more like system level memory I think is really interesting and building up, building towards this idea of almost like continual learning.
So there's, you know, like can you learn from your interactions and you can do that in a variety of different ways.
This may just be where we sit in the ecosystem, but one exciting and probably under talked about ways is just the idea of, of building.
building up a few shot example data sets and really using those.
I think it's much faster and cheaper than fine-tuning models.
It's easier to do than trying to like programmatically change the prompt in some way.
Like that's still kind of like a bit of a art.
And so yeah, towards continual learning with few-shot examples is maybe one like really interesting area that we're excited about.
Can you help our listeners like just imagine like a little bit more viscerally like what
type of application experience that would enable, like, you know, a consumer application or a
application of what that type of continuous learning would allow you to do. Yeah, absolutely. I think at a
high level, it would basically allow the application to automatically get better over time. And it could
get better in the sense that it's just more accurate. So, you know, it's, it maybe, you know,
it first does a mistake. You then, like, tell it that it made a mistake and it automatically kind
of, like, incorporates that as a few shot example or update to a prompt. But it starts learning from
its mistakes and its successes as well, right?
There's a really cool project called DSPY or DSPI.
I don't know how to pronounce it, but it's out of Stanford.
I say Disby.
Oh, no, there's three ways now.
I say Diaspa.
No, I'm just kidding.
So, and I think that actually tackles, like, I actually see a lot of similarities
between that and Ling Chain, Ling Smith in some way.
And I think it's all towards this idea of like, so DSPY,
Dispee or whatever. It's basically this idea of like optimization. You have kind of like
inputs outputs. You then have your application, which they similarly think is as like multiple
steps. And you basically optimize your application through a variety of different ways.
The main one of which I would say is probably few shot examples, although we'll probably do a
webinar with Omar and he can correct me if I'm wrong. And I think the idea of like continual
learning is basically doing that optimization, but in an online manner, where you don't have
like ground truth necessarily, but you get feedback from the environment, thumbs up, thumbs down
if things are good. And so I think, yeah, that kind of like optimization loop, whether offline or
online is really, really exciting. And I think a similar thing could maybe, I think you can think
of like personalization also as like what this would look like to end users and maybe like consumer
facing apps. So you start with like a generic application that does the same thing for everyone.
but then it maybe learns to search the web differently for me and ELAD or something like that.
And so I think that's, like, concretely how it could manifest.
Cool. Thanks so much for doing this. It's obviously a pleasure to have you on.
No, thank you guys. Good to see you.
Find us on Twitter at NoPriarsPod. Subscribe to our YouTube channel if you want to see your faces.
Follow the show on Apple Podcasts, Spotify, or wherever you listen. That way you get a new episode every week.
And sign up for emails or find transcripts for every episode.
at no dash priors.com.