The Infra Pod - What happens when LLMs have a API App store? Let's chat Gorilla and beyond!
Episode Date: October 9, 2023Ian and Tim are back to interview another YAIGer Shishir G. Patil, who is a PhD at Berkeley that worked on the popular LLM research + project Gorilla, that allows LLMs to call APIs automatically. We ...dive into how Gorilla works and what is the future implications it brings to the infra / developer space.
Transcript
Discussion (0)
All right, welcome to our yet another Infra Deep Dive podcast.
As usual, Tim from Essence VC and Ian, let's take it away.
Awesome. I'm Ian, currently helping Sneak turn into a platform.
And I am super excited today to be joined by one of the authors of the Gorilla GPT paper,
Shishir Patil.
Can you please tell us a little about yourself?
Yeah. Hello, Tim. Hello, Jan.
I'm Shishir, and I'm a fifth-year PhD at Berkeley.
So I'm part of the Sky Lab,
which was previously the Rice Lab,
and also the Berkeley AI Research efforts.
And right now, I've done a bunch of work on systems
for ML. And for now, I think the focus has been on how do you teach LLMs to invoke APIs. And this
is what I'll be talking about today. Before this, I spent a couple of years as a research fellow at
Microsoft Research. And even before that, I finished my undergrad back in India.
Amazing. So you've been in this space, in the research for a long time. And now you're hyper-focused on this very interesting overlap,
which is how we take natural language and turn it into an API call.
So a human can program effectively with a sentence, which is pretty incredible.
Can you explain to us what is your Gorilla Paper specifically?
And what does it enable us to do?
So yeah, the high- idea is, you know,
sometime around November, December last year,
when we start playing with LLMs, like almost everybody else,
chatting is a great demonstration of the technology of LLMs,
but it can get in the way if you want to get something done.
Like you don't want things to be chatting unnecessarily.
Like, you know, if you're having dinner or lunch, you like chatting.
But if you want to get something done, then you want to get it done ASAP. You don't want to sit and chat,
right? So this was the idea. And once you realize this, it's like, okay, so LLMs are a powerful tool.
And then the utility of a tool increases when more people use the tool. And when tools can
talk to other tools. When you connect tools is when the utility of the tool increases.
So this was the idea, or at least the genesis of the idea.
And we were like, okay, so tools need to talk to each other.
And computer systems, the way tools talk to each other is through APIs.
So that's like the well-defined promise response kind of interface
that you have where different tools talk to each other.
So this was the idea.
Then we were like, okay, so can we now train an LLM
to actually go ahead and invoke API calls? So now the LLM can go ahead and then use different tools and talk to the rest of
the world through this API interface. So Gorilla is an LLM where if a user asks a question in
natural language or defines a task in natural language, it'll pick the right API to call,
which will get the job done for you. So this means not just regurgitating information that it knows, or it's not even like creative content,
but it's like actually,
how do you A, read the state of the world?
By world, I mean different services, products,
and then go ahead and do an action
that's going to bring about a change
in the status of the world.
Interesting.
As I said at the beginning,
it really does allow us to open up programming
as well as to string tools together,
which if you kind of sit back and think about it, a lot of what programming is
is stringing these different little tools,
different little coroutines together.
How does your system
learn about the different tools
in the world? For example, I looked at your demo
and you have this demo on the CLI
that says, hey, I want to get stuff from this
bucket and put it in this other place and it generates
the perfect AWS S3 rsync
command to basically take stuff from one bucket to another.
How are you teaching the system?
Where does that corpus come from?
And how is it auto-correcting to give you the right thing?
The short answer is bloody of Stalin's bed.
I mean, there's no magical letter.
You're trying to go ahead and get all the tools and information.
But the important question here is not how do you get it,
but what do you want to focus on?
What APIs do you want to support?
Because there's just a dime a dozen.
And the reason as to getting these APIs is not challenging.
And this is actually pretty unique to what we're doing
compared to the journey folks or the text-to-image
or the text-to-text people might be doing
is mostly because today there's a lot of lawsuits on,
oh, you use this copyrighted book content,
and hence it's not kosher.
But in the case of APIs,
the incentives are very well aligned, right?
More people use an API,
the more money you make as AWS.
So AWS itself has an incentive to now
like wrap its APIs in new tutorials,
different modalities from there's like,
you know, React APIs,
there's obviously CLI, Python, so on and so forth.
And not just that, but also it's like,
they give out open API specs,
like a documentation, tutorials, et cetera,
to make people use it.
So getting this information is actually pretty easy.
The implementation might be tricky,
but in terms of incentives and getting access to it,
it's pretty straightforward.
And then people appreciate you doing it.
So that's straightforward.
The only concern, if I may,
is that there's different modalities, like some of the websites, maybe JavaScript, etc. So then
how do you get this data? If they give you like neat JSON swagger files, that's great. But not
everyone might have those neat documentation. So collecting this is like, it takes some time. But
once you have this, then you can go ahead and implement it. The answer to your second part of
your question is determining which ones to pick is in some sense a function of what you want to get done. The first
set of APIs we picked, we were trying to write a paper. So we picked HuggingFace, TensorFlow,
TorchHub sort of APIs for the ML community. It's also interesting because these APIs are
pretty challenging. Between a Stable Diffusion V2 API and a Stripe API, it's very diverse.
It's relatively easy for you to
distinguish and figure out what's going on. But say between a Hugging Face Stable Diffusion API
versus a TorchUp Stable Diffusion API, it's pretty challenging. They're not free, they have the same
templates. The Hugging Face make you additional pipelines and some pre-training boilerplate
template code as well. So it's quite challenging. So we picked these as the sort of APIs we support.
But then later on,
once we realized
nobody uses these APIs,
we moved on to like
Kubernetes, AWS, GCP,
all the hyperscalers,
basically Azure
and a lot of like
Linux man pages
and Linux commands.
Sorry.
Yeah.
Amazing.
And I'm curious,
what do you think we need to do
over the next year
or so or six months even to see the advancement to go from the paper you have today and the GitHub repository and the CLI to actually putting this in the hands of people building products?
Where are we at today and how far does this need to go before we can start embedding this sort of task-orientated NLP to API stuff in more products?
Right. That's a good question.
Probably the way I would think of this is
into two different branches, right? One
is, what can we make this into a product?
And the way I see that today is, for now,
it exists. And I think it's like
we're at least fairly confident by
looking at the usage that there is value today
to use it. So I think now the question
is, how focused can you be
in adding more new APIs? Like right now,
I think a lot of people are asking us for like Salesforce
and ServiceNow and Datadog sort of APIs.
So can we include this for like a broader community
who can use it?
So that's one part of it,
which is, you know, X needs to be done.
Can you get it done?
But there's also a lot of interesting research questions
if you want to grow this.
This is along the lines of, okay,
what if you have three or four different APIs
and you're trying to compose all of them together to get something done?
This could be like, oh, you're
doing a podcast, you ping someone and say, hey, would you
be interested if they are informed by Positive?
Can you send them a Calendly link? Make sure it's
set up, and then send them an invite to make sure
they join on time. And probably also remind them
they don't shut their laptop down, like how I just did.
So, yeah, so it's like, you know,
these things where can you automate this
sort of pipeline? And this gets tricky. Like, it's easy if you have the same framework, but if you have different frameworks, right? Like, oh, so it's like, you know, these things where can you automate this sort of pipeline? And this gets tricky.
Like, it's easy if you have the same framework, but if you have different frameworks, right?
Like, oh, here's a JavaScript API to do X, but then you need to change back to like a RESTful API to do Y.
And then, by the way, can you execute this Python command to get C done?
So it's like, if you have this sort of like multi-modality APIs, then how do you do that?
So I think that's pretty challenging and an open problem. And two is,
and this is probably quite relevant as
more and more people start using it, is how do you make it
robust? So what do I mean by that, right?
Suppose you're trying to download a file from the internet.
It fails. You try again. Great.
But maybe sometimes that's not always the right
scenario. So when something fails,
instead of trying again, you might want to
try a different mirror, right? So it's like,
oh, you know, this particular bucket on GCP failed.
Can I use this Oracle Cloud bucket, which seems to have less traffic and so I can get
good network?
Or maybe for some APIs that may not be true.
Like if it's Stripe, if something fails, you have to investigate.
You can't do it again and double charge the user.
That may not be ideal.
So determining what the failure mechanism is in the real world is actually pretty interesting.
And I think it's still an open research question.
These are the three fronts which I think are critical to take this into your daily workflow.
Maybe one thing I'd like to clarify, because when you hear the word API,
the first thing I think of is just all this sort of web online APIs you can call,
any Google web service that you can call, you know, any Google web
service that you have in examples, but this could be any SaaS APIs or open APIs. But when you're
looking at the Gorilla Paper, I think you pick the ML models you can actually choose with the
right parameters or what we call the hugging phase. And that's a particular API. And to make
that work, I guess, for as a paper, and you can have the accuracy and everything, you had to pick almost like a lane to focus on.
I think in the paper,
you kind of already explained why you focus on it,
but maybe just talk about why did you pick that?
And from a research or maybe even for a personal interest,
what's next for you to go to explore?
Are you going to keep going down the ML models
and doing more of that,
or are you going to go to other domains as well?
Yeah, so the answer is very practical.
There's no more science to it.
We're trying to write a paper for a particular machine learning community.
This is not the community that understands Stripe and Admin APIs.
Let's pick a set of APIs that people would understand and appreciate.
So that's the single focus that we had is to write a good academic paper for the community.
But in terms of focus,
I feel like we have tried to be less married
to the idea of, like you mentioned,
like the Python Hugging Face or TensorFlow APIs
to actually expand it to also include RESTful APIs
or like, you know, even SQL
or any of the other APIs that exist.
And today we have some collaborations
where we also launch
GraphQL APIs with
the Gorilla VBA. So it's like, yeah,
the goal is, can you expand the wide
set of APIs? And we want to do this
slowly to make sure that our techniques and recipes
still holds, right? So one thing
that we do in the paper is, how do you measure hallucination
using abstract syntax tree
subtree matching? Now, you know, it is
clear to us, how do you do this for machine learning APIs, know, it is clear to us how do you do this
for like machine learning APIs,
but now also the idea
is how do you expand this
beyond machine learning
to like RESTful API calls?
You know,
you just match the header
or just the website
or even the body
and, you know,
all the different parameters.
So that requires some thought,
but yeah,
our plan is to like expand this
to most of the APIs.
But what you don't want to do
is coding.
And this is actually a very subtle difference.
Between coding and APIs,
it's easier and harder in different scenarios.
Like in coding, there's branching,
there's decision variables,
there's also looping.
APIs, there's none of that.
More often than not, there's none of that.
But at the same time, APIs are very brittle.
If you make small mistakes in coding,
as long as it's not syntactic,
your program still runs.
But you have multiple chances to correct it.
API call, you do a wrong call,
you get 404, that call is done.
Then you may go to the next call,
but at least that call is buggy.
Our focus at least has been purely to do APIs
and not do coding.
One of the things this makes me think about is like,
and I heard this quote from several others,
and I've seen this in my own experience,
is we go to adopt any form
of machine learning. It's like they tend to be,
the model is really this core nugget
of a much broader system to
make them all possible. One of the
things I can think of is, on the offline
side, is building up a larger training corpus,
enabling more modalities.
There's a lot of research questions, but
what systems need to exist
for us to truly scale to something like RelaOut?
Both in terms of the API service area, but also in terms of solving some of the problems you just mentioned.
Which is like, oh, well, the API 404, we shouldn't go to the next step.
We need to go back and maybe there's a learning loop that needs the feedback in the model so we can have the right chain.
So that ultimately, from a customer's or user's perspective, and I use
customer because I sit in the business of selling
software, but they get a
successful outcome. So they actually get the productivity
improvement that we're all trying to sell them, which
is this idea that you mentioned, which is like, hey, we need
to schedule an event and we
need to invite Tim and Ian over to Berkeley
and it just does it.
Obviously, under the hood, there's the lossy
network that's involved.
So mistakes happen,
API calls aren't correct, blah, blah, blah.
So yeah, I'd love to get your perspective.
When you think about,
I know you're focused heavily
on the core of this core nugget of the system,
what else needs to exist?
And where are we at
in terms of that systems engineering today?
Right.
So I think on the LLM to the API side
is I think where at least today, there's a few more things to be done, but it's a matter of time before it gets done.
But the one tooling that I think is pretty still open question is who's going to execute this?
It's like, great, you want to get something done.
Here's a Stripe API to call, but who's going to call it for you?
And as much as you poke around beyond the exec function from Python or, you Python or beyond writing a small bash thread that's
running as a microservice and executes it for you,
there's really no platform that executes
the APIs for you. This is tricky
because you might have environment variables.
For example, you're calling the OpenAI chat configuration
API. Well, you need to have the OpenAI key
in your environment variable. You're calling AWS,
you need to have the AWS config file set,
so on and so forth.
Who manages all of these
quote unquote secrets?
At least that's what GitHub uses
in its CI, CD pipeline,
so on and so forth.
So yeah, how do you take care
of environment variables?
And two, where are you going to execute it?
And these are like unknown questions
even today.
And it's not clear,
especially if you imagine the scenario
where there's one provider
who's executing it for you.
What would that look like? That's one.
And the second one is, what about state? Who's going to maintain state?
So if you say something like, can you get me data from this S3 bucket?
And then if you say, can you please delete that bucket?
What does that refer to? Especially if it's like a long running conversation.
Or if you were to say, hey, can you give me access to this bucket
where you specify exactly what this is,
then who's going to go and check
that do you have access to that bucket or not, right?
So are you going to read the state of the world
by querying AWS?
At which point do you maintain that state yourself locally?
And if so, how do you maintain state?
So fundamentally along execution,
like who's going to execute
and what's the best morality to execute? And second, how do you maintain state? So fundamentally, along execution, who's going to execute and what's the best
morality to execute? And second,
how do you maintain state? And who's going to
maintain state? You can do
naive things and get away with it, but it's
not very scalable. For AWS, you might
maintain state, read state. If the rest of the
API calls, well, if you know that an
endpoint is failing, you call that again.
So then who's going to maintain the
fact that that's failing?
Stuff like that.
Right, that makes total sense and fits in with
a broader narrative that Tim and I have been discussing
on the pod, which is
there's all these other systems infrastructure
work going on, which are truly, in many ways,
if we look at Gorilla
as a place we want to get to, the thing
we want to enable, there's all this work that we have to
do still as an industry
around state-for-workflow management, around secure code execution,
around blank, blank, blank, blank, blank.
And so AI is really that accelerator.
It really answers the question, well, why?
The why is we want to get over here to this North Star,
this experience that enables a lot of people in order to get there.
We have all these problems that have to be unbundled.
Amazing. I'm curious, for your perspective, how much does model evaluation time fit into the equation?
And how much does some of the work around edge compute play into the future of experiences like Corolla?
Is it really important as a key enabler?
Or is it something you say, yeah, it's nice, but we've got bigger fish to fry first?
That's a good question.
And the way I think about this is,
I'll take the edge one first,
is that that's going to be critical, right?
This is not even like,
oh, edge computing is privacy preserving and hence it's better or you're full control,
which is all true.
But even if you were to say,
look, I'm ready to sell my data to the highest bidder.
I don't care about privacy.
How do I get it done?
Even in these scenarios, right?
There'll be some tasks which require freshness of data
and the efficiency and the high reasoning capabilities,
which will still be the realm of the cloud providers.
But there are a few things for which using the cloud
may or may not make sense.
One example is, you know,
if you have like the latest Pixel or even an iPhone,
if you take an image of you in a beach
and there's some people behind you,
there's a magic eraser that can erase people
from your background, right? Well, so that's a value-added service. Nobody makes any
money on it. It's not latency critical. Suppose you're using Instagram and you want to use a
filter that's using some multimodal big model. Within 2 milliseconds or 20 milliseconds or 200
milliseconds, if you're using an Instagram filter or TikTok filter, it doesn't really matter much.
So in these scenarios where the value-add is not very critical, and most people are using
it as a free service, would the large providers continue to bankroll it? Unclear, right? So I feel
like in those scenarios, you might say, hey, look, your devices are getting more and more powerful.
Let me just push this model locally for you. And then you can run it. There's also like,
my family is in India, if I want to take a picture of all of us skiing,
that doesn't exist.
So this would require some amount of fine tuning.
Can I probably run some of the small fine tuning steps
locally on my device, keep the model on my device,
and then generate this picture?
In these realms, I think probably Edge is the way to go.
But in domains like, oh, can I book a flight ticket
to Hawaii, and also, by the way, recommend hotels to me
near this convention center, then this requires freshness
of data hitting a bunch of APIs.
This may be more realm of cloud, right?
So I feel like, yeah, that's probably
what I think is a cloud-edge
divide. Since you have
the paper out, and you create a Discord
channel, and I'm in that Discord channel, and you
got hundreds, I think thousands
of people there just constantly
doing a bunch of discussions and questions.
And when you have a community, it's always have fun.
What are some of the biggest highlights, I guess, for you working with the larger community?
What are some interesting ideas or things that you learn from a community that you think are actually great?
Either use cases or future research ideas?
Yeah, at least all of my projects so far have been open source projects that are out there that people use.
I've been doing this for a while now, and I think this is really helpful.
So for example, when we had Corella, one thing we realized was
not very many people could still download the models and use it.
And we got this feedback within the first few hours.
So immediately, we put up the
Colab notebook where you're like, oh look, you can just hit our chat completion APIs.
You don't have to give any of your keys, et cetera. So a lot of people use it.
So when we realized that a lot of people were using it, but now they were interested in
integrating this into their workflows. So then we were like, can we expose this open AI equivalent
chat completion API that people can use within their workflows with
their long chain or some of the agents that they may be building. And then we realized, oh look, a lot of people were like, I understand this, but I don't want to use APIs a lot. It's
unclear to me how Goddard will be useful to me. And we're like, that's a good question.
So then we're like, okay, what if we have the CLI tool, which is a complete end-to-end tool.
So you ask a question on the CLI, it shows you a bunch of suggestions of commands,
you execute it,
you get a response.
And then you're like,
ah, I see how this works.
You know what,
now I'm going to do X.
So I feel like community
is great to give you direction
on where people are getting confused
and how best to help that.
But it also gives you
a high-level direction
what to focus on.
Probably if it was just us,
we added TensorFlow,
TorchHub, HuggingFace,
we might have taught each other
an extra of APIs to add.
But then it was the community
who told us,
hey, can you give us Kubernetes?
That's the bane for every developer.
We all use it.
It's super hard.
Can you help us with that?
So then we added this
and there's been like
a bunch of traction.
And also it's from the Discord,
we learned that,
oh, like a lot of people
want to learn
the stock price of Meta
on April 21st
or, you know,
something like that. So then, okay,
I mean, now support the finance
set of APIs in this different form factor.
So yeah, it's been actually
quite helpful. In terms of research,
most of the community won't tell you exactly, oh, here's
a research question, go solve it.
But if you look at the trends and if you realize
that, okay, so a lot of people do not
have their own LLMs, right? Today, it's
like everybody wants to train their own loader adapters, they have their own LLMs, right? Today, it's like everybody wants to train their own loader adapters to have their own
LLMs, but they want to connect that to some execution engine that can expose APIs.
And you're like, oh, okay, so this is where the research question is.
All right, so we're going to go into the most fun part of this pod, which we call the
spicy future.
Spicy future.
You know what it is, hot takes, spicy takes, whatever it is.
So tell us what you believe will happen,
especially in this space where the LLM is integrating with APIs or tools, right?
Like what's the next few years of the state of the art, right?
And maybe describe maybe a bit of the end state.
And it will be also helpful to tell us
what are the key things you think will unlock this.
Okay, Tim.
So my take on this is that today the way it works
is you as a user want to get something done.
You talk to the LLM.
LLM tells you to do X and then you go and do something.
As an example, if you want to install CUDA,
you say, hey, I have this Ubuntu 20.04,
this particular GPU, can you tell me the CUDA,
the CUDAN and the TensorFlow torch, whatever version, and then you go and install it, right? But you are the bottleneck in this process.
So that shouldn't be true. What I think is going to happen is you as a user are going to ask an
LLM to do X, the LLM is going to execute and then show you the results and you either accept or
reject it. Humans are good discriminators, but not good generators. So you let the generator do its
job, do not get in the way, do not be the slow link. And once it does,
you either do the next step or you figure
out what's going on. Instead of you at the center,
you're now going to have the LLM at the center performing
things and you are just talking to the LLM
and not to the rest of the tools.
And so how do we get away
from that? How do we get to the point
where the LLM is also a good discriminator?
That's the fundamental question I
keep thinking to myself, which is ultimately and maybe the answer is not the LLM is also a good discriminator. That's the fundamental question I keep thinking to myself,
which is ultimately, and maybe the answer is it's not the LLM is the wrong tool, but I'm
kind of curious to get your thought on that. How do you make a computer an amazing discriminator?
Yeah, this is slightly tricky, right? Because you may not really want it.
A lot of interesting things happen because we have different preferences and different opinions.
And the question might be, how do you train your LLM to do that? Well, it's just second guessing you.
If you were to do that, then the obvious answer is,
everybody should have their own LLM.
And that's going to happen.
You already see that happening.
So different modalities, people are going to either end-to-end
or through small changes.
People do that already through prompting.
But you're basically going to have multiple LLMs per person.
Back in the day, people were like,
there's going to be five computers for everyone.
Today, it's just happening where people are thinking, oh, there's going to be one LLM
for everybody or a few big providers.
I think that's going to happen. You're going to have multiple LLMs
for you working in tandem
or probably against each other.
But that's a phenomenon where you go ahead and you
train your beliefs and
to be a discriminator, you need to understand
what you care and then drive down the
decisions for that. So the question is, how do you do that?
And the biggest challenge will not be, is it possible? But the biggest challenge will be, can you express your
core tenets precisely? Which I think is tricky.
And when you express a core tenet, you might be like the value system for making a decision.
And in a way that the LOM can say, oh, okay, I understand the value system
you're importing to me. And I'm going to use that as a part of my decision-making framework.
And I also love the idea of the future where you have all these multiple LLMs.
I'm curious to get your take on who owns those LLMs.
Would I own? Would there be EN-calendly?
Would there be EN-email LLM?
I'm also curious, when you think about it, what are the different LLMs?
Are these different companies?
Am I owning them?
How does this work?
I'm curious to get your thought on,
if we're all going to have different LLMs,
what is it going to look like and why?
Okay, so that's a good question.
In my mind, if you think of it from a data perspective,
like a Snowflake DB,
or I'm doing X, which you're coming from,
then that's a question to ask.
But in my mind, your LLM is going to be like an email.
Every email is secure.
It's private.
People have different levels of trust with their email clients and providers.
You might use Outlook to access a Gmail that's being provided to your G Suite, or you might
be very pedantic and say, I'm going to use ProtonMail, or you may have your own web server with a key that you need to send your email that all generates
every six hours or so. But most people agree that
there's some amount of spam that's in the email, there's a fair amount of security that you give up
anyways, but you're still okay with it because it's mostly a communication medium.
You're going to use your email to log into Chase or J.P. Morgan, but you're not going to
let your email talk to them. Your email is not the one that's holding your money. That's going to be the same
with LLMs. LLMs is now a modality of communication. You're still going to have different tools.
Those tools may use LLMs themselves. That's fine. Your banks may use some, which is fine.
But you basically are going to talk to your tools through an LLM. So the privacy is important,
but at the same time,
it's more or less like access control privacy and less about data at rest sort of privacy.
Again, this is speculative. Well, we love speculative. One thing pops in my mind is,
I think we're talking about a general outcome state, but what's not clear to me is probably,
we do have a bunch of LLMs, which are the hugging faces, the tensor hug.
We have the models, but those are not the same models maybe that you're describing.
Like what are abstractions you think are great or important here for us to even consider doing more fundamental work or research on?
Do we need a chat interface, almost an email interface, to understand a bunch of LLMs?
I don't know, do you have any particular
thoughts what that might even look like?
I told this before, but I
think chatting is not a great modality
of LLMs. It is for some
things, right? Like, can you plan me
a vacation in X?
That's pretty good, right?
But you already see that GitHub Copilot
is much more preferred as a
means of integrating into your code base and individual llms right and similarly a lot of
people who use some of your lms as backends for like companion apps you know prefer to use the
apps themselves even though you can do everything that you can just using the chat completion api so
you already see that happening even for the most the ground seed of apps that are coming out.
So yeah, I feel like your backend is
almost not going to matter. So think of your LLM as
some sort of computing power. You don't
care if it's an ARM core. I mean,
for performance, you do, but traditionally, you make
it work across your ARM cores,
your x86 cores, etc. And then you have different
modalities that you're using to interact with them.
So I think of LLMs as compute.
You're going to have some LLMs that are going to be general purpose compute, like your x86. You're using to interact with them, right? So I think of LLMs as compute. You're going to have some LLMs
that are going to be general purpose compute,
like your x86.
You're going to have accelerators
like your TPUs, GPUs.
There are many more, right?
Like your phone today has an audio encoder,
a video encoder,
and a bunch of different accelerators.
So you're going to have multiple
of these small LLMs.
You don't even talk to them.
Every time you take a photo,
your image processor is at play.
At no point do you benchmark it
or compare it or even complain
about it. So you're going to have a bunch of these small
LLMs that are being orchestrated by
a few big
LLMs, all working in tandem.
That's how I think of it. I like to take the analogy
to computing. I know we're in the spicy
future, but I think this fits.
It's spicy. I'm curious to get your take,
especially after you spent so much time building
Gorilla and building things with the space specifically.
What's the asymptote of the transformer?
How far can we actually scale this architecture?
We can talk about LLMs in broad text,
but when you talk about LLMs,
is this LLMs with transformers
or is it under assumption that we have new architectures
that come out to help us solve multimodality,
help us solve theodality, help us solve the
multitask flow.
Do we need new architectures
for that to happen, or is the transformer going to take us there?
Yeah, and if I
knew the answer, I would be winning a best paper.
you start thinking for alternatives if you feel like
you're hitting a wall if you're bleeding out.
And today, that's not
true. We do hear some murmurs that, oh, look, you're notating out right and today that's not that's not true like we do hear
some moments that oh look you're not getting enough high quality tokens or you know transformers may
not scale but for all the people who are running the experiments and playing the models that has
not been true you know scaling laws still hold to the extent that people have tried it as long as
right now the trend is the more data and compute that you're throwing at it, you're getting the performance points, right?
It's only when it starts play-doing do you start asking those questions.
It might be a good foresight to start asking it right now,
except you don't know where the play-do is going to be.
So, you know, is it pretty much our question?
And I think right now that tokens are getting you far enough
and quote-unquote throwing more compute and data at the problem
seems to be getting you far enough.
We still don't know the recipe.
Even within transformers,
well, today the way people do pre-training
is you have all your different mixture,
you throw it together,
you subsample at least
the open source models,
you subsample some part of it
for every mini batch
and then you train it.
Well, can curriculum learning help?
Some of the other techniques
that we know that in reinforcement learning
seems to have helped.
And a lot of these models
undergo loss hacking.
So can we fix a lot of those?
So I feel like these things are going to give you
much quicker and much better bang for the buck.
In terms of architecture,
there will be more and more research,
but I think most people have today converged
to at least a specific style.
It's different for multimodal.
I still don't think for multimodal,
transformers is the right way to go.
There might be some activity and some action there.
But for language, I think it's pretty converged
to this particular architecture.
And since you mentioned that we're going to have
a general compute, x86, and we're going to have the DSPs
and particular small LLMs that they orchestrate with.
Today, I feel like you really don't know which LLM they do
until you just get to read every history,
how they train the data sets, what they actually do.
Like the capabilities and what's especially done for it's never been really described
anywhere.
Do you think there's something that needs to exist as well to actually able to describe
very clearly, I am able, capable to do X and there's a much better way to even like express
it in some metrics or specific language.
So the general LLMs even have a chance to know exactly how you accroche an API.
I know this LLMs are able to do some things here and it becomes another interface.
Is that something that's worth diving into as a body of work?
So 100% yes, that's something we should look into.
And people are deriving metrics today.
At least like most people who are using it
may not fully appreciate it,
but they may go for,
oh, I asked the question X,
does it give me answer response X prime
or X double prime?
So that might be what people are looking at.
And it's been shown that,
especially with chatbots,
like a lot of other things matter a lot.
For example,
if you were to show two responses, one before other,
versus the length of responses, some people like long responses,
some people like short responses.
So that seems to be having a more consequential impact than other things.
But as a practitioner, if you're trying to deploy these LLMs into your end case, then people seem to have come up with other interesting metrics.
For example, at least us, when we try the different LLMs,
we start off with the Lama base
and then we looked at the Falcon and NPT
because people are asking for open source models,
you can actually see how long does it take to converge.
This is not just on wall clock time,
which might be impacted by systems optimizations,
but also in terms of how many epochs it takes, etc.
And these are like interesting signals,
at the very least, that tell you,
okay, here's where the performance of these models are, here's where we can think of them. And I do think this is the right way to go as to
think of your end-to-end performance and see where it fits. Like when someone says x86,
you don't talk about x86 performance, probably yes, in terms of ROFLOPS, et cetera, but you look
at, oh, what is the LINPACK score? What's the dense DGEM, dense general multiplication score
for this particular architecture or this particular device? What's the dense dgem dense general multiplication score for this particular architecture or this particular device you know what's the arithmetic intensity so even for
something like computing it's like a lot of your metrics are determined by your workloads that you
tend to use so i feel it's going to be the same right if you think of your lms compute then it
doesn't make sense to have a metric that's going to remind you lm you still have a number of
parameters inference latency tokens per second so on and so forth. But most of your evaluation is going to
happen around, here's my workload, here's my downstream task, and here's how well it performs,
or here's how well it fails. Probably that's the right way to do it because, yeah.
Cool. Well, we're going to even more difficult land now. If you're going to have very speculative,
what's going to happen in the next five to 10 years
in this space that you're looking at?
Do you have any longer term hot take like that?
Yeah, we're going to end up in one of two scenarios, right?
One is like the self-driving scenario.
In 2017, when I was entering my PhD,
the hottest thing you could do was self-driving.
Five years later, there's been like progress,
but not to the extent that we expected, right? In my mind, I was like, in two years, self-driving five years later there's been like progress but not to
the extent that we expected right in my mind i was okay in two years self-driving is solved and i was
kicking myself for not doing vision and perception because that seemed to be where there's a lot of
activity and you know but even today it's like you know the performance is there but it's not
it's not exponentially increasing year over year or not to the extent where you expected it given
the amount of both knowledge
and financial power that was thrown into it.
So LLM might head there, right?
It's like, you know, it's great,
but the last 5% becomes so hard,
you might either never fix it
or it's going to take you way longer to fix it.
Hope that doesn't happen.
On the other side, I think that's a scenario that I enjoy.
It's one where it's going to be so prevalent
that you won't even know the distinction or the difference
that much. You're like, oh, I wanted to get
X done. It used an LLM underneath, but
I didn't even know it used an LLM underneath.
So the technology is most useful and
beneficial when it's transparent, where you don't
even know you're using it. At least that's what
people are heading towards. That's what you tend to see with all
of these agents, and that's what you tend to see with a lot of people
use it. It's like, you know, when you use a companion
app, you don't know you're talking to an AI.
For terms of sale, they're explicitly mentioned to say so, but otherwise you would not prefer.
Similarly, if NVIDIA says, oh, here's a Docker, here's how you install X or do Y,
and you don't even know what's going on underneath. That's like the ideal scenario, right?
So yeah, the best case scenario is where it's so transparent and prevalent that it almost
goes as a unnoticed side comment.
I think there's a lot of people that sit back and say,
that's a very spicy hot take.
In the VC world, a lot of people are presupposing
that LLMs are a change in how we build products,
or a change in how we interface with one another.
They change everything.
And so your take is like, well, they're there,
but they're under the hood.
I mean, I think my personal hot take is I tend to more agree with you.
It's going to make the things that we already have feel more natural, feel better, and probably work better as a layer on top.
I'm curious, what is something that you think people are saying is the future, but you don't believe in around this space?
We saw a lot of things about what LLMs will do for us.
I'm curious, is there a
specific thing that you've seen maybe
pushed in by the VCs or pushed by
certain areas, just an idea
that you think, hey, I don't believe
in that and this is why?
Yeah, I think more than
the VCs, at least among a few academic
circles, there's been this
campaign where it's like, oh, with
LLMs, with charity AI in general,
also referring to a lot of
the text-to-image stuff, there's going to be a whole
bunch of deepfakes.
This is going to be unprecedented. You can't tell the right
from the wrong, and that's a cause of concern.
In my mind, look, this happened
long ago.
There was a time when every email you got
was from a defense-funded lab, and that changed long ago, right? Like there was a time when every email you got was from a defense funded
lab. And that changed long ago. Today, when someone sends me an email saying, oh, we are
sending you this email from the Prince of Nigeria, I immediately don't care, right? And right now,
the way you have credibility is by looking at, oh, who sent you this email? Does this make sense to
me? Or does it not make sense to me? Like there is an element of trust, even a text message to
your number, which is personal. Like humans are very good at a placing trust and b when it's wrong you tend to like hedge and say
look i'm not going to click on this link etc there might be some people who are still following the
bait but that's not a thing it's like you know every email i don't get anxious or stressed
thinking did it really come from x or y and similarly it's like there might be an increase
in spam content so on and so forth, but you don't care.
There's going to be more fake content online, so what?
We have faced a problem before, 20 years ago.
We fixed it and it's a good start.
I have not been very kind in noticing any of the VC takes,
so you should give me a pat for that.
Okay, we can just call all VCs however we want to.
No issues here at all.
But hey,
this is awesome.
I think we have
a ton of fun stuff
we covered.
Yeah, Tim and Ian,
thank you so much
for having me.
This was fun.
I like the unstructured
discussion part of it.
Super nice.
Thanks so much.
We had a great time.
Where can people find you?
That is the ultimate question.
The common modality is Twitter. It's my first name, last name, underscore,
LinkedIn, and of course, Discord. Gorilla LLM should hopefully get you to the right place.
Amazing. Thank you so much. We really enjoyed having you and I hope you have a great rest of your day. Thank you.