The Infra Pod - What happens when LLMs have a API App store? Let's chat Gorilla and beyond!

Starting point is 00:00:00 All right, welcome to our yet another Infra Deep Dive podcast. As usual, Tim from Essence VC and Ian, let's take it away. Awesome. I'm Ian, currently helping Sneak turn into a platform. And I am super excited today to be joined by one of the authors of the Gorilla GPT paper, Shishir Patil. Can you please tell us a little about yourself? Yeah. Hello, Tim. Hello, Jan. I'm Shishir, and I'm a fifth-year PhD at Berkeley.

Starting point is 00:00:37 So I'm part of the Sky Lab, which was previously the Rice Lab, and also the Berkeley AI Research efforts. And right now, I've done a bunch of work on systems for ML. And for now, I think the focus has been on how do you teach LLMs to invoke APIs. And this is what I'll be talking about today. Before this, I spent a couple of years as a research fellow at Microsoft Research. And even before that, I finished my undergrad back in India. Amazing. So you've been in this space, in the research for a long time. And now you're hyper-focused on this very interesting overlap,

Starting point is 00:01:07 which is how we take natural language and turn it into an API call. So a human can program effectively with a sentence, which is pretty incredible. Can you explain to us what is your Gorilla Paper specifically? And what does it enable us to do? So yeah, the high- idea is, you know, sometime around November, December last year, when we start playing with LLMs, like almost everybody else, chatting is a great demonstration of the technology of LLMs,

Starting point is 00:01:35 but it can get in the way if you want to get something done. Like you don't want things to be chatting unnecessarily. Like, you know, if you're having dinner or lunch, you like chatting. But if you want to get something done, then you want to get it done ASAP. You don't want to sit and chat, right? So this was the idea. And once you realize this, it's like, okay, so LLMs are a powerful tool. And then the utility of a tool increases when more people use the tool. And when tools can talk to other tools. When you connect tools is when the utility of the tool increases. So this was the idea, or at least the genesis of the idea.

Starting point is 00:02:07 And we were like, okay, so tools need to talk to each other. And computer systems, the way tools talk to each other is through APIs. So that's like the well-defined promise response kind of interface that you have where different tools talk to each other. So this was the idea. Then we were like, okay, so can we now train an LLM to actually go ahead and invoke API calls? So now the LLM can go ahead and then use different tools and talk to the rest of the world through this API interface. So Gorilla is an LLM where if a user asks a question in

Starting point is 00:02:35 natural language or defines a task in natural language, it'll pick the right API to call, which will get the job done for you. So this means not just regurgitating information that it knows, or it's not even like creative content, but it's like actually, how do you A, read the state of the world? By world, I mean different services, products, and then go ahead and do an action that's going to bring about a change in the status of the world.

Starting point is 00:02:58 Interesting. As I said at the beginning, it really does allow us to open up programming as well as to string tools together, which if you kind of sit back and think about it, a lot of what programming is is stringing these different little tools, different little coroutines together. How does your system

Starting point is 00:03:11 learn about the different tools in the world? For example, I looked at your demo and you have this demo on the CLI that says, hey, I want to get stuff from this bucket and put it in this other place and it generates the perfect AWS S3 rsync command to basically take stuff from one bucket to another. How are you teaching the system?

Starting point is 00:03:28 Where does that corpus come from? And how is it auto-correcting to give you the right thing? The short answer is bloody of Stalin's bed. I mean, there's no magical letter. You're trying to go ahead and get all the tools and information. But the important question here is not how do you get it, but what do you want to focus on? What APIs do you want to support?

Starting point is 00:03:47 Because there's just a dime a dozen. And the reason as to getting these APIs is not challenging. And this is actually pretty unique to what we're doing compared to the journey folks or the text-to-image or the text-to-text people might be doing is mostly because today there's a lot of lawsuits on, oh, you use this copyrighted book content, and hence it's not kosher.

Starting point is 00:04:06 But in the case of APIs, the incentives are very well aligned, right? More people use an API, the more money you make as AWS. So AWS itself has an incentive to now like wrap its APIs in new tutorials, different modalities from there's like, you know, React APIs,

Starting point is 00:04:21 there's obviously CLI, Python, so on and so forth. And not just that, but also it's like, they give out open API specs, like a documentation, tutorials, et cetera, to make people use it. So getting this information is actually pretty easy. The implementation might be tricky, but in terms of incentives and getting access to it,

Starting point is 00:04:38 it's pretty straightforward. And then people appreciate you doing it. So that's straightforward. The only concern, if I may, is that there's different modalities, like some of the websites, maybe JavaScript, etc. So then how do you get this data? If they give you like neat JSON swagger files, that's great. But not everyone might have those neat documentation. So collecting this is like, it takes some time. But once you have this, then you can go ahead and implement it. The answer to your second part of

Starting point is 00:05:04 your question is determining which ones to pick is in some sense a function of what you want to get done. The first set of APIs we picked, we were trying to write a paper. So we picked HuggingFace, TensorFlow, TorchHub sort of APIs for the ML community. It's also interesting because these APIs are pretty challenging. Between a Stable Diffusion V2 API and a Stripe API, it's very diverse. It's relatively easy for you to distinguish and figure out what's going on. But say between a Hugging Face Stable Diffusion API versus a TorchUp Stable Diffusion API, it's pretty challenging. They're not free, they have the same templates. The Hugging Face make you additional pipelines and some pre-training boilerplate

Starting point is 00:05:40 template code as well. So it's quite challenging. So we picked these as the sort of APIs we support. But then later on, once we realized nobody uses these APIs, we moved on to like Kubernetes, AWS, GCP, all the hyperscalers, basically Azure

Starting point is 00:05:54 and a lot of like Linux man pages and Linux commands. Sorry. Yeah. Amazing. And I'm curious, what do you think we need to do

Starting point is 00:06:02 over the next year or so or six months even to see the advancement to go from the paper you have today and the GitHub repository and the CLI to actually putting this in the hands of people building products? Where are we at today and how far does this need to go before we can start embedding this sort of task-orientated NLP to API stuff in more products? Right. That's a good question. Probably the way I would think of this is into two different branches, right? One is, what can we make this into a product? And the way I see that today is, for now,

Starting point is 00:06:32 it exists. And I think it's like we're at least fairly confident by looking at the usage that there is value today to use it. So I think now the question is, how focused can you be in adding more new APIs? Like right now, I think a lot of people are asking us for like Salesforce and ServiceNow and Datadog sort of APIs.

Starting point is 00:06:48 So can we include this for like a broader community who can use it? So that's one part of it, which is, you know, X needs to be done. Can you get it done? But there's also a lot of interesting research questions if you want to grow this. This is along the lines of, okay,

Starting point is 00:07:01 what if you have three or four different APIs and you're trying to compose all of them together to get something done? This could be like, oh, you're doing a podcast, you ping someone and say, hey, would you be interested if they are informed by Positive? Can you send them a Calendly link? Make sure it's set up, and then send them an invite to make sure they join on time. And probably also remind them

Starting point is 00:07:17 they don't shut their laptop down, like how I just did. So, yeah, so it's like, you know, these things where can you automate this sort of pipeline? And this gets tricky. Like, it's easy if you have the same framework, but if you have different frameworks, right? Like, oh, so it's like, you know, these things where can you automate this sort of pipeline? And this gets tricky. Like, it's easy if you have the same framework, but if you have different frameworks, right? Like, oh, here's a JavaScript API to do X, but then you need to change back to like a RESTful API to do Y. And then, by the way, can you execute this Python command to get C done? So it's like, if you have this sort of like multi-modality APIs, then how do you do that?

Starting point is 00:07:42 So I think that's pretty challenging and an open problem. And two is, and this is probably quite relevant as more and more people start using it, is how do you make it robust? So what do I mean by that, right? Suppose you're trying to download a file from the internet. It fails. You try again. Great. But maybe sometimes that's not always the right scenario. So when something fails,

Starting point is 00:08:00 instead of trying again, you might want to try a different mirror, right? So it's like, oh, you know, this particular bucket on GCP failed. Can I use this Oracle Cloud bucket, which seems to have less traffic and so I can get good network? Or maybe for some APIs that may not be true. Like if it's Stripe, if something fails, you have to investigate. You can't do it again and double charge the user.

Starting point is 00:08:18 That may not be ideal. So determining what the failure mechanism is in the real world is actually pretty interesting. And I think it's still an open research question. These are the three fronts which I think are critical to take this into your daily workflow. Maybe one thing I'd like to clarify, because when you hear the word API, the first thing I think of is just all this sort of web online APIs you can call, any Google web service that you can call, you know, any Google web service that you have in examples, but this could be any SaaS APIs or open APIs. But when you're

Starting point is 00:08:51 looking at the Gorilla Paper, I think you pick the ML models you can actually choose with the right parameters or what we call the hugging phase. And that's a particular API. And to make that work, I guess, for as a paper, and you can have the accuracy and everything, you had to pick almost like a lane to focus on. I think in the paper, you kind of already explained why you focus on it, but maybe just talk about why did you pick that? And from a research or maybe even for a personal interest, what's next for you to go to explore?

Starting point is 00:09:19 Are you going to keep going down the ML models and doing more of that, or are you going to go to other domains as well? Yeah, so the answer is very practical. There's no more science to it. We're trying to write a paper for a particular machine learning community. This is not the community that understands Stripe and Admin APIs. Let's pick a set of APIs that people would understand and appreciate.

Starting point is 00:09:42 So that's the single focus that we had is to write a good academic paper for the community. But in terms of focus, I feel like we have tried to be less married to the idea of, like you mentioned, like the Python Hugging Face or TensorFlow APIs to actually expand it to also include RESTful APIs or like, you know, even SQL or any of the other APIs that exist.

Starting point is 00:10:03 And today we have some collaborations where we also launch GraphQL APIs with the Gorilla VBA. So it's like, yeah, the goal is, can you expand the wide set of APIs? And we want to do this slowly to make sure that our techniques and recipes still holds, right? So one thing

Starting point is 00:10:17 that we do in the paper is, how do you measure hallucination using abstract syntax tree subtree matching? Now, you know, it is clear to us, how do you do this for machine learning APIs, know, it is clear to us how do you do this for like machine learning APIs, but now also the idea is how do you expand this beyond machine learning

Starting point is 00:10:30 to like RESTful API calls? You know, you just match the header or just the website or even the body and, you know, all the different parameters. So that requires some thought,

Starting point is 00:10:39 but yeah, our plan is to like expand this to most of the APIs. But what you don't want to do is coding. And this is actually a very subtle difference. Between coding and APIs, it's easier and harder in different scenarios.

Starting point is 00:10:51 Like in coding, there's branching, there's decision variables, there's also looping. APIs, there's none of that. More often than not, there's none of that. But at the same time, APIs are very brittle. If you make small mistakes in coding, as long as it's not syntactic,

Starting point is 00:11:04 your program still runs. But you have multiple chances to correct it. API call, you do a wrong call, you get 404, that call is done. Then you may go to the next call, but at least that call is buggy. Our focus at least has been purely to do APIs and not do coding.

Starting point is 00:11:19 One of the things this makes me think about is like, and I heard this quote from several others, and I've seen this in my own experience, is we go to adopt any form of machine learning. It's like they tend to be, the model is really this core nugget of a much broader system to make them all possible. One of the

Starting point is 00:11:34 things I can think of is, on the offline side, is building up a larger training corpus, enabling more modalities. There's a lot of research questions, but what systems need to exist for us to truly scale to something like RelaOut? Both in terms of the API service area, but also in terms of solving some of the problems you just mentioned. Which is like, oh, well, the API 404, we shouldn't go to the next step.

Starting point is 00:11:57 We need to go back and maybe there's a learning loop that needs the feedback in the model so we can have the right chain. So that ultimately, from a customer's or user's perspective, and I use customer because I sit in the business of selling software, but they get a successful outcome. So they actually get the productivity improvement that we're all trying to sell them, which is this idea that you mentioned, which is like, hey, we need to schedule an event and we

Starting point is 00:12:17 need to invite Tim and Ian over to Berkeley and it just does it. Obviously, under the hood, there's the lossy network that's involved. So mistakes happen, API calls aren't correct, blah, blah, blah. So yeah, I'd love to get your perspective. When you think about,

Starting point is 00:12:30 I know you're focused heavily on the core of this core nugget of the system, what else needs to exist? And where are we at in terms of that systems engineering today? Right. So I think on the LLM to the API side is I think where at least today, there's a few more things to be done, but it's a matter of time before it gets done.

Starting point is 00:12:49 But the one tooling that I think is pretty still open question is who's going to execute this? It's like, great, you want to get something done. Here's a Stripe API to call, but who's going to call it for you? And as much as you poke around beyond the exec function from Python or, you Python or beyond writing a small bash thread that's running as a microservice and executes it for you, there's really no platform that executes the APIs for you. This is tricky because you might have environment variables.

Starting point is 00:13:14 For example, you're calling the OpenAI chat configuration API. Well, you need to have the OpenAI key in your environment variable. You're calling AWS, you need to have the AWS config file set, so on and so forth. Who manages all of these quote unquote secrets? At least that's what GitHub uses

Starting point is 00:13:29 in its CI, CD pipeline, so on and so forth. So yeah, how do you take care of environment variables? And two, where are you going to execute it? And these are like unknown questions even today. And it's not clear,

Starting point is 00:13:40 especially if you imagine the scenario where there's one provider who's executing it for you. What would that look like? That's one. And the second one is, what about state? Who's going to maintain state? So if you say something like, can you get me data from this S3 bucket? And then if you say, can you please delete that bucket? What does that refer to? Especially if it's like a long running conversation.

Starting point is 00:14:02 Or if you were to say, hey, can you give me access to this bucket where you specify exactly what this is, then who's going to go and check that do you have access to that bucket or not, right? So are you going to read the state of the world by querying AWS? At which point do you maintain that state yourself locally? And if so, how do you maintain state?

Starting point is 00:14:22 So fundamentally along execution, like who's going to execute and what's the best morality to execute? And second, how do you maintain state? So fundamentally, along execution, who's going to execute and what's the best morality to execute? And second, how do you maintain state? And who's going to maintain state? You can do naive things and get away with it, but it's not very scalable. For AWS, you might

Starting point is 00:14:35 maintain state, read state. If the rest of the API calls, well, if you know that an endpoint is failing, you call that again. So then who's going to maintain the fact that that's failing? Stuff like that. Right, that makes total sense and fits in with a broader narrative that Tim and I have been discussing

Starting point is 00:14:51 on the pod, which is there's all these other systems infrastructure work going on, which are truly, in many ways, if we look at Gorilla as a place we want to get to, the thing we want to enable, there's all this work that we have to do still as an industry around state-for-workflow management, around secure code execution,

Starting point is 00:15:09 around blank, blank, blank, blank, blank. And so AI is really that accelerator. It really answers the question, well, why? The why is we want to get over here to this North Star, this experience that enables a lot of people in order to get there. We have all these problems that have to be unbundled. Amazing. I'm curious, for your perspective, how much does model evaluation time fit into the equation? And how much does some of the work around edge compute play into the future of experiences like Corolla?

Starting point is 00:15:36 Is it really important as a key enabler? Or is it something you say, yeah, it's nice, but we've got bigger fish to fry first? That's a good question. And the way I think about this is, I'll take the edge one first, is that that's going to be critical, right? This is not even like, oh, edge computing is privacy preserving and hence it's better or you're full control,

Starting point is 00:15:55 which is all true. But even if you were to say, look, I'm ready to sell my data to the highest bidder. I don't care about privacy. How do I get it done? Even in these scenarios, right? There'll be some tasks which require freshness of data and the efficiency and the high reasoning capabilities,

Starting point is 00:16:09 which will still be the realm of the cloud providers. But there are a few things for which using the cloud may or may not make sense. One example is, you know, if you have like the latest Pixel or even an iPhone, if you take an image of you in a beach and there's some people behind you, there's a magic eraser that can erase people

Starting point is 00:16:24 from your background, right? Well, so that's a value-added service. Nobody makes any money on it. It's not latency critical. Suppose you're using Instagram and you want to use a filter that's using some multimodal big model. Within 2 milliseconds or 20 milliseconds or 200 milliseconds, if you're using an Instagram filter or TikTok filter, it doesn't really matter much. So in these scenarios where the value-add is not very critical, and most people are using it as a free service, would the large providers continue to bankroll it? Unclear, right? So I feel like in those scenarios, you might say, hey, look, your devices are getting more and more powerful. Let me just push this model locally for you. And then you can run it. There's also like,

Starting point is 00:17:03 my family is in India, if I want to take a picture of all of us skiing, that doesn't exist. So this would require some amount of fine tuning. Can I probably run some of the small fine tuning steps locally on my device, keep the model on my device, and then generate this picture? In these realms, I think probably Edge is the way to go. But in domains like, oh, can I book a flight ticket

Starting point is 00:17:20 to Hawaii, and also, by the way, recommend hotels to me near this convention center, then this requires freshness of data hitting a bunch of APIs. This may be more realm of cloud, right? So I feel like, yeah, that's probably what I think is a cloud-edge divide. Since you have the paper out, and you create a Discord

Starting point is 00:17:38 channel, and I'm in that Discord channel, and you got hundreds, I think thousands of people there just constantly doing a bunch of discussions and questions. And when you have a community, it's always have fun. What are some of the biggest highlights, I guess, for you working with the larger community? What are some interesting ideas or things that you learn from a community that you think are actually great? Either use cases or future research ideas?

Starting point is 00:18:08 Yeah, at least all of my projects so far have been open source projects that are out there that people use. I've been doing this for a while now, and I think this is really helpful. So for example, when we had Corella, one thing we realized was not very many people could still download the models and use it. And we got this feedback within the first few hours. So immediately, we put up the Colab notebook where you're like, oh look, you can just hit our chat completion APIs. You don't have to give any of your keys, et cetera. So a lot of people use it.

Starting point is 00:18:33 So when we realized that a lot of people were using it, but now they were interested in integrating this into their workflows. So then we were like, can we expose this open AI equivalent chat completion API that people can use within their workflows with their long chain or some of the agents that they may be building. And then we realized, oh look, a lot of people were like, I understand this, but I don't want to use APIs a lot. It's unclear to me how Goddard will be useful to me. And we're like, that's a good question. So then we're like, okay, what if we have the CLI tool, which is a complete end-to-end tool. So you ask a question on the CLI, it shows you a bunch of suggestions of commands, you execute it,

Starting point is 00:19:05 you get a response. And then you're like, ah, I see how this works. You know what, now I'm going to do X. So I feel like community is great to give you direction on where people are getting confused

Starting point is 00:19:15 and how best to help that. But it also gives you a high-level direction what to focus on. Probably if it was just us, we added TensorFlow, TorchHub, HuggingFace, we might have taught each other

Starting point is 00:19:25 an extra of APIs to add. But then it was the community who told us, hey, can you give us Kubernetes? That's the bane for every developer. We all use it. It's super hard. Can you help us with that?

Starting point is 00:19:35 So then we added this and there's been like a bunch of traction. And also it's from the Discord, we learned that, oh, like a lot of people want to learn the stock price of Meta

Starting point is 00:19:43 on April 21st or, you know, something like that. So then, okay, I mean, now support the finance set of APIs in this different form factor. So yeah, it's been actually quite helpful. In terms of research, most of the community won't tell you exactly, oh, here's

Starting point is 00:19:56 a research question, go solve it. But if you look at the trends and if you realize that, okay, so a lot of people do not have their own LLMs, right? Today, it's like everybody wants to train their own loader adapters, they have their own LLMs, right? Today, it's like everybody wants to train their own loader adapters to have their own LLMs, but they want to connect that to some execution engine that can expose APIs. And you're like, oh, okay, so this is where the research question is. All right, so we're going to go into the most fun part of this pod, which we call the

Starting point is 00:20:18 spicy future. Spicy future. You know what it is, hot takes, spicy takes, whatever it is. So tell us what you believe will happen, especially in this space where the LLM is integrating with APIs or tools, right? Like what's the next few years of the state of the art, right? And maybe describe maybe a bit of the end state. And it will be also helpful to tell us

Starting point is 00:20:43 what are the key things you think will unlock this. Okay, Tim. So my take on this is that today the way it works is you as a user want to get something done. You talk to the LLM. LLM tells you to do X and then you go and do something. As an example, if you want to install CUDA, you say, hey, I have this Ubuntu 20.04,

Starting point is 00:21:00 this particular GPU, can you tell me the CUDA, the CUDAN and the TensorFlow torch, whatever version, and then you go and install it, right? But you are the bottleneck in this process. So that shouldn't be true. What I think is going to happen is you as a user are going to ask an LLM to do X, the LLM is going to execute and then show you the results and you either accept or reject it. Humans are good discriminators, but not good generators. So you let the generator do its job, do not get in the way, do not be the slow link. And once it does, you either do the next step or you figure out what's going on. Instead of you at the center,

Starting point is 00:21:30 you're now going to have the LLM at the center performing things and you are just talking to the LLM and not to the rest of the tools. And so how do we get away from that? How do we get to the point where the LLM is also a good discriminator? That's the fundamental question I keep thinking to myself, which is ultimately and maybe the answer is not the LLM is also a good discriminator. That's the fundamental question I keep thinking to myself,

Starting point is 00:21:45 which is ultimately, and maybe the answer is it's not the LLM is the wrong tool, but I'm kind of curious to get your thought on that. How do you make a computer an amazing discriminator? Yeah, this is slightly tricky, right? Because you may not really want it. A lot of interesting things happen because we have different preferences and different opinions. And the question might be, how do you train your LLM to do that? Well, it's just second guessing you. If you were to do that, then the obvious answer is, everybody should have their own LLM. And that's going to happen.

Starting point is 00:22:11 You already see that happening. So different modalities, people are going to either end-to-end or through small changes. People do that already through prompting. But you're basically going to have multiple LLMs per person. Back in the day, people were like, there's going to be five computers for everyone. Today, it's just happening where people are thinking, oh, there's going to be one LLM

Starting point is 00:22:27 for everybody or a few big providers. I think that's going to happen. You're going to have multiple LLMs for you working in tandem or probably against each other. But that's a phenomenon where you go ahead and you train your beliefs and to be a discriminator, you need to understand what you care and then drive down the

Starting point is 00:22:43 decisions for that. So the question is, how do you do that? And the biggest challenge will not be, is it possible? But the biggest challenge will be, can you express your core tenets precisely? Which I think is tricky. And when you express a core tenet, you might be like the value system for making a decision. And in a way that the LOM can say, oh, okay, I understand the value system you're importing to me. And I'm going to use that as a part of my decision-making framework. And I also love the idea of the future where you have all these multiple LLMs. I'm curious to get your take on who owns those LLMs.

Starting point is 00:23:15 Would I own? Would there be EN-calendly? Would there be EN-email LLM? I'm also curious, when you think about it, what are the different LLMs? Are these different companies? Am I owning them? How does this work? I'm curious to get your thought on, if we're all going to have different LLMs,

Starting point is 00:23:34 what is it going to look like and why? Okay, so that's a good question. In my mind, if you think of it from a data perspective, like a Snowflake DB, or I'm doing X, which you're coming from, then that's a question to ask. But in my mind, your LLM is going to be like an email. Every email is secure.

Starting point is 00:23:50 It's private. People have different levels of trust with their email clients and providers. You might use Outlook to access a Gmail that's being provided to your G Suite, or you might be very pedantic and say, I'm going to use ProtonMail, or you may have your own web server with a key that you need to send your email that all generates every six hours or so. But most people agree that there's some amount of spam that's in the email, there's a fair amount of security that you give up anyways, but you're still okay with it because it's mostly a communication medium. You're going to use your email to log into Chase or J.P. Morgan, but you're not going to

Starting point is 00:24:23 let your email talk to them. Your email is not the one that's holding your money. That's going to be the same with LLMs. LLMs is now a modality of communication. You're still going to have different tools. Those tools may use LLMs themselves. That's fine. Your banks may use some, which is fine. But you basically are going to talk to your tools through an LLM. So the privacy is important, but at the same time, it's more or less like access control privacy and less about data at rest sort of privacy. Again, this is speculative. Well, we love speculative. One thing pops in my mind is, I think we're talking about a general outcome state, but what's not clear to me is probably,

Starting point is 00:25:02 we do have a bunch of LLMs, which are the hugging faces, the tensor hug. We have the models, but those are not the same models maybe that you're describing. Like what are abstractions you think are great or important here for us to even consider doing more fundamental work or research on? Do we need a chat interface, almost an email interface, to understand a bunch of LLMs? I don't know, do you have any particular thoughts what that might even look like? I told this before, but I think chatting is not a great modality

Starting point is 00:25:34 of LLMs. It is for some things, right? Like, can you plan me a vacation in X? That's pretty good, right? But you already see that GitHub Copilot is much more preferred as a means of integrating into your code base and individual llms right and similarly a lot of people who use some of your lms as backends for like companion apps you know prefer to use the

Starting point is 00:25:58 apps themselves even though you can do everything that you can just using the chat completion api so you already see that happening even for the most the ground seed of apps that are coming out. So yeah, I feel like your backend is almost not going to matter. So think of your LLM as some sort of computing power. You don't care if it's an ARM core. I mean, for performance, you do, but traditionally, you make it work across your ARM cores,

Starting point is 00:26:18 your x86 cores, etc. And then you have different modalities that you're using to interact with them. So I think of LLMs as compute. You're going to have some LLMs that are going to be general purpose compute, like your x86. You're using to interact with them, right? So I think of LLMs as compute. You're going to have some LLMs that are going to be general purpose compute, like your x86. You're going to have accelerators like your TPUs, GPUs.

Starting point is 00:26:30 There are many more, right? Like your phone today has an audio encoder, a video encoder, and a bunch of different accelerators. So you're going to have multiple of these small LLMs. You don't even talk to them. Every time you take a photo,

Starting point is 00:26:41 your image processor is at play. At no point do you benchmark it or compare it or even complain about it. So you're going to have a bunch of these small LLMs that are being orchestrated by a few big LLMs, all working in tandem. That's how I think of it. I like to take the analogy

Starting point is 00:26:56 to computing. I know we're in the spicy future, but I think this fits. It's spicy. I'm curious to get your take, especially after you spent so much time building Gorilla and building things with the space specifically. What's the asymptote of the transformer? How far can we actually scale this architecture? We can talk about LLMs in broad text,

Starting point is 00:27:16 but when you talk about LLMs, is this LLMs with transformers or is it under assumption that we have new architectures that come out to help us solve multimodality, help us solve theodality, help us solve the multitask flow. Do we need new architectures for that to happen, or is the transformer going to take us there?

Starting point is 00:27:33 Yeah, and if I knew the answer, I would be winning a best paper. you start thinking for alternatives if you feel like you're hitting a wall if you're bleeding out. And today, that's not true. We do hear some murmurs that, oh, look, you're notating out right and today that's not that's not true like we do hear some moments that oh look you're not getting enough high quality tokens or you know transformers may not scale but for all the people who are running the experiments and playing the models that has

Starting point is 00:27:55 not been true you know scaling laws still hold to the extent that people have tried it as long as right now the trend is the more data and compute that you're throwing at it, you're getting the performance points, right? It's only when it starts play-doing do you start asking those questions. It might be a good foresight to start asking it right now, except you don't know where the play-do is going to be. So, you know, is it pretty much our question? And I think right now that tokens are getting you far enough and quote-unquote throwing more compute and data at the problem

Starting point is 00:28:23 seems to be getting you far enough. We still don't know the recipe. Even within transformers, well, today the way people do pre-training is you have all your different mixture, you throw it together, you subsample at least the open source models,

Starting point is 00:28:33 you subsample some part of it for every mini batch and then you train it. Well, can curriculum learning help? Some of the other techniques that we know that in reinforcement learning seems to have helped. And a lot of these models

Starting point is 00:28:43 undergo loss hacking. So can we fix a lot of those? So I feel like these things are going to give you much quicker and much better bang for the buck. In terms of architecture, there will be more and more research, but I think most people have today converged to at least a specific style.

Starting point is 00:28:58 It's different for multimodal. I still don't think for multimodal, transformers is the right way to go. There might be some activity and some action there. But for language, I think it's pretty converged to this particular architecture. And since you mentioned that we're going to have a general compute, x86, and we're going to have the DSPs

Starting point is 00:29:16 and particular small LLMs that they orchestrate with. Today, I feel like you really don't know which LLM they do until you just get to read every history, how they train the data sets, what they actually do. Like the capabilities and what's especially done for it's never been really described anywhere. Do you think there's something that needs to exist as well to actually able to describe very clearly, I am able, capable to do X and there's a much better way to even like express

Starting point is 00:29:43 it in some metrics or specific language. So the general LLMs even have a chance to know exactly how you accroche an API. I know this LLMs are able to do some things here and it becomes another interface. Is that something that's worth diving into as a body of work? So 100% yes, that's something we should look into. And people are deriving metrics today. At least like most people who are using it may not fully appreciate it,

Starting point is 00:30:11 but they may go for, oh, I asked the question X, does it give me answer response X prime or X double prime? So that might be what people are looking at. And it's been shown that, especially with chatbots, like a lot of other things matter a lot.

Starting point is 00:30:24 For example, if you were to show two responses, one before other, versus the length of responses, some people like long responses, some people like short responses. So that seems to be having a more consequential impact than other things. But as a practitioner, if you're trying to deploy these LLMs into your end case, then people seem to have come up with other interesting metrics. For example, at least us, when we try the different LLMs, we start off with the Lama base

Starting point is 00:30:45 and then we looked at the Falcon and NPT because people are asking for open source models, you can actually see how long does it take to converge. This is not just on wall clock time, which might be impacted by systems optimizations, but also in terms of how many epochs it takes, etc. And these are like interesting signals, at the very least, that tell you,

Starting point is 00:31:03 okay, here's where the performance of these models are, here's where we can think of them. And I do think this is the right way to go as to think of your end-to-end performance and see where it fits. Like when someone says x86, you don't talk about x86 performance, probably yes, in terms of ROFLOPS, et cetera, but you look at, oh, what is the LINPACK score? What's the dense DGEM, dense general multiplication score for this particular architecture or this particular device? What's the dense dgem dense general multiplication score for this particular architecture or this particular device you know what's the arithmetic intensity so even for something like computing it's like a lot of your metrics are determined by your workloads that you tend to use so i feel it's going to be the same right if you think of your lms compute then it doesn't make sense to have a metric that's going to remind you lm you still have a number of

Starting point is 00:31:43 parameters inference latency tokens per second so on and so forth. But most of your evaluation is going to happen around, here's my workload, here's my downstream task, and here's how well it performs, or here's how well it fails. Probably that's the right way to do it because, yeah. Cool. Well, we're going to even more difficult land now. If you're going to have very speculative, what's going to happen in the next five to 10 years in this space that you're looking at? Do you have any longer term hot take like that? Yeah, we're going to end up in one of two scenarios, right?

Starting point is 00:32:15 One is like the self-driving scenario. In 2017, when I was entering my PhD, the hottest thing you could do was self-driving. Five years later, there's been like progress, but not to the extent that we expected, right? In my mind, I was like, in two years, self-driving five years later there's been like progress but not to the extent that we expected right in my mind i was okay in two years self-driving is solved and i was kicking myself for not doing vision and perception because that seemed to be where there's a lot of activity and you know but even today it's like you know the performance is there but it's not

Starting point is 00:32:39 it's not exponentially increasing year over year or not to the extent where you expected it given the amount of both knowledge and financial power that was thrown into it. So LLM might head there, right? It's like, you know, it's great, but the last 5% becomes so hard, you might either never fix it or it's going to take you way longer to fix it.

Starting point is 00:32:56 Hope that doesn't happen. On the other side, I think that's a scenario that I enjoy. It's one where it's going to be so prevalent that you won't even know the distinction or the difference that much. You're like, oh, I wanted to get X done. It used an LLM underneath, but I didn't even know it used an LLM underneath. So the technology is most useful and

Starting point is 00:33:14 beneficial when it's transparent, where you don't even know you're using it. At least that's what people are heading towards. That's what you tend to see with all of these agents, and that's what you tend to see with a lot of people use it. It's like, you know, when you use a companion app, you don't know you're talking to an AI. For terms of sale, they're explicitly mentioned to say so, but otherwise you would not prefer. Similarly, if NVIDIA says, oh, here's a Docker, here's how you install X or do Y,

Starting point is 00:33:34 and you don't even know what's going on underneath. That's like the ideal scenario, right? So yeah, the best case scenario is where it's so transparent and prevalent that it almost goes as a unnoticed side comment. I think there's a lot of people that sit back and say, that's a very spicy hot take. In the VC world, a lot of people are presupposing that LLMs are a change in how we build products, or a change in how we interface with one another.

Starting point is 00:33:58 They change everything. And so your take is like, well, they're there, but they're under the hood. I mean, I think my personal hot take is I tend to more agree with you. It's going to make the things that we already have feel more natural, feel better, and probably work better as a layer on top. I'm curious, what is something that you think people are saying is the future, but you don't believe in around this space? We saw a lot of things about what LLMs will do for us. I'm curious, is there a

Starting point is 00:34:27 specific thing that you've seen maybe pushed in by the VCs or pushed by certain areas, just an idea that you think, hey, I don't believe in that and this is why? Yeah, I think more than the VCs, at least among a few academic circles, there's been this

Starting point is 00:34:44 campaign where it's like, oh, with LLMs, with charity AI in general, also referring to a lot of the text-to-image stuff, there's going to be a whole bunch of deepfakes. This is going to be unprecedented. You can't tell the right from the wrong, and that's a cause of concern. In my mind, look, this happened

Starting point is 00:35:00 long ago. There was a time when every email you got was from a defense-funded lab, and that changed long ago, right? Like there was a time when every email you got was from a defense funded lab. And that changed long ago. Today, when someone sends me an email saying, oh, we are sending you this email from the Prince of Nigeria, I immediately don't care, right? And right now, the way you have credibility is by looking at, oh, who sent you this email? Does this make sense to me? Or does it not make sense to me? Like there is an element of trust, even a text message to your number, which is personal. Like humans are very good at a placing trust and b when it's wrong you tend to like hedge and say

Starting point is 00:35:29 look i'm not going to click on this link etc there might be some people who are still following the bait but that's not a thing it's like you know every email i don't get anxious or stressed thinking did it really come from x or y and similarly it's like there might be an increase in spam content so on and so forth, but you don't care. There's going to be more fake content online, so what? We have faced a problem before, 20 years ago. We fixed it and it's a good start. I have not been very kind in noticing any of the VC takes,

Starting point is 00:36:01 so you should give me a pat for that. Okay, we can just call all VCs however we want to. No issues here at all. But hey, this is awesome. I think we have a ton of fun stuff we covered.

Starting point is 00:36:14 Yeah, Tim and Ian, thank you so much for having me. This was fun. I like the unstructured discussion part of it. Super nice. Thanks so much.

Starting point is 00:36:21 We had a great time. Where can people find you? That is the ultimate question. The common modality is Twitter. It's my first name, last name, underscore, LinkedIn, and of course, Discord. Gorilla LLM should hopefully get you to the right place. Amazing. Thank you so much. We really enjoyed having you and I hope you have a great rest of your day. Thank you.

The Infra Pod - What happens when LLMs have a API App store? Let's chat Gorilla and beyond!

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.