Software Huddle - Deep Dive into Inference Optimization for LLMs with Philip Kiely

Starting point is 00:00:00 I think one of the challenges, especially people kind of jumping into this space right now, is there's just so many models out there. You know, there's closed, there's open source, big, little. You know, when starting some kind of new project, like, how do you even pick? Like, where do you start? I generally like to recommend that people start with the largest, most capable model when they're in the experimentation phase. Unless there's some constraint that means you can't do that, like if you're building exclusively for edge inference on an iPhone or something. If you're starting with the most capable model,

Starting point is 00:00:33 that eliminates one constraint during this experimentation and prototyping process, which is, oh, is my model not good enough? When does it make sense for somebody to fine-tune a model? So I saw there's this great tweet that goes around a lot, no GPUs before PMF, which is the idea that you shouldn't be making these very capital-intensive investments in training and fine-tuning

Starting point is 00:01:02 ahead of figuring out what you actually want to build. So in terms of inference, what is the expensive part of inference? Why is that particularly a slow operation? Hey, everyone. Thanks for joining. Today, we have Philip Keeley from Base10 on the show. We go deep on inference optimization. We cover choosing a model,

Starting point is 00:01:24 discuss the hype around compound AI, choosing an inference engine, optimization techniques like quantization and speculative decoding, all the way down to your GPU choice. There's a lot of meat on the bone in this one, so I think you'll enjoy it. As always, if you have questions or suggestions, feel free to reach out to me or Alex. And with that said, let's get you over to my conversation with Philip. Philip, welcome to Software Huddle. Thanks, Sean. I'm super glad to be here today. Yeah, thanks. Thanks for being here and glad you reached out. So to start, can you give a little bit of background about yourself and the work that you do at Base10? How did you

Starting point is 00:02:00 end up working at an AI inference company? And what was your background in AI prior to that? I joined Base10 two years, nine months ago. So I've been there in startup years basically forever. Yeah, yeah. That's like, that's at least a decade in real life. Yeah, exactly. So I joined as a technical writer, and have since moved into more developer relations type stuff. But you know it's a 50 person series B startup so titles are titles, roles are roles. It's really just about doing what you can do to drive what needs to happen every single day. Anyway, Base10 itself is a model influence startup. So we're an AI infrastructure company that focuses on inference. We run open source, fine tune, and custom models

Starting point is 00:02:49 with really high performance on GPUs that can be distributed across multiple clouds, your cloud, our cloud, some combination of the two. And when I joined, it was more of a data science-facing tool. Back then, this was before GPT. 3.5 was out. This was before ChatGPT, before Whisper, before DALI and Stable Diffusion.

Starting point is 00:03:13 So it was a somewhat different market and audience. And I joined with very little background in AI or ML. You know, I got like a B- in statistics in college. But I was able to learn on the job. And the cool part about my job is I basically just get to talk to all of our engineers, find out the cool stuff that they're working on and then write it down and have it make sense. And in order for that to happen, it needs to make sense to me first. So basically, my job is to learn really interesting things and then share them with the world yeah i think you know i always found like a way to uh for at least for myself

Starting point is 00:03:52 like to really internalize something and to learn it is to like force yourself to try to teach it to somebody because you have to understand it enough that you can like teach another person and writing is a you know one form of of of, of, of teaching. And then of course, like, you know, actually teaching and, you know, or public speaking or whatever it is, it becomes like a forcing function for you to learn different things. And I even feel like I use the podcast medium for that as it becomes a forcing function for me to learn about, you know, different topics where I may not necessarily know or have deep expertise in. Absolutely. You know, I, uh, I can't go up on stage in a conference or meetup or something and just wing it.

Starting point is 00:04:28 So there's a lot of learning that goes into every single one of those things. That's where that forcing function comes in, like you said. Yeah, and it has to feel authentic too. And I think that's where, you know, based on my own experience, sometimes in very technical products,

Starting point is 00:04:42 selling to technical people, salespeople can kind of struggle where even if they have a script and they have, you know, they've been trained and they have a first call deck, it's like they're speaking a foreign language in some ways

Starting point is 00:04:54 because it's just never going to be sort of at the comfort level as somebody who's like actually worked in that space as an engineer and can kind of deliver the same material. Absolutely.

Starting point is 00:05:03 That said, I have worked very closely with our salespeople to help train them and do sales enablement in some of this technical material. And I've been really impressed by how they've grasped the complexities of stuff like the different GPUs and how they have various capabilities, the different model inference frameworks, the model performance optimizations, the different modalities. There's a ton to learn. So I've

Starting point is 00:05:31 learned it and it's been cool helping them learn it as well. Yeah, I didn't mean to dox all salespeople. Yeah, absolutely. I think there's some great salespeople out there that are going to be able to grok this stuff. But going back to Base 10, why focus on inference alone? Why be essentially an AI inference infrastructure company? That's a great question. And it's something that we kind of found our way into. Because originally, when I joined, the platform was much broader.

Starting point is 00:06:04 And we've experimented with other things. We did a whole fine-tuning platform maybe a year and a half, almost two years ago at this point as an experiment. And what we found was two things. Number one, we're a pretty small company. You know, at this point we're 50 people. We need to focus on being the absolute best at one thing and own the right to enter other markets in this AI space.

Starting point is 00:06:31 And the second thing is that influence is really where a lot of the value is. You need to, you know, you can have the best model or you can have, you know, some insight into how to build a new product around existing AI capabilities. But if you can't run your models in production, if they're not secure, if they're not reliable, if they're not performant enough that you're going to hit your latency targets, if they can't handle throughput, spikes in traffic,

Starting point is 00:07:01 top of product hunt, top of hacker news, launch days, then the rest of it kind of doesn't matter. So we really see influence as the centerpiece of this AI infrastructure space around which everything revolves. And we want to go become really, really good at that one central thing, and then over time add on more capabilities. Yeah, I think that makes a lot of sense.

Starting point is 00:07:28 I mean, maybe like a rough analogy to that is kind of even like traditional data warehousing where essentially people, like Snowflake came along and separated compute and storage. So then you have essentially like the query engine that is like the key thing that you need in analytics in order to be performant, and then you have essentially like the query engine that is like the key thing that you need in analytics in order to be performant.

Starting point is 00:07:46 And then you need large storage. And what you're sort of, you know, what you guys are doing and other companies as well are like sort of decomposing this like AI stack. So inference is sort of the compute engine that's important for generative AI. And you're sort of separating that from other parts of it. And then you can focus on like, let's make this thing like really efficient. Absolutely. And you do need to do more than just get a GPU and put the model on it.

Starting point is 00:08:11 With Inverse, we've been able to learn a lot about, for example, observability and how those metrics and decisions can differ when you're working with an AI model within your product rather than just some ordinary API. So the observability stack looks a little different. The logging looks a little different. And the scaling looks a lot different, actually, the sort of traffic-based auto-scaling that you have to build around that is, you know, something that was surprising to us when we came across the correct answer because it was, you know, not something you might intuit from previous experience building

Starting point is 00:08:56 in other domains. Mm-hmm. Yeah. I kind of feel like when it comes to a lot of this stuff, you need like a certain amount of like exposure to it, to building up sort of the intuition that comes along with it, to understand like what is the right, you know, lever to sort of pull on or thread to pull on or thing to do in order to get the performance that you need,

Starting point is 00:09:22 whether that is like from an accuracy standpoint or even from like a compute standpoint is you need to build up essentially enough knowledge that some of that stuff becomes innate. And I think one of the challenges right now is that a lot of people, basically this technology hasn't been around long enough for that many people to kind of have their 10,000 hours of building these types of systems.

Starting point is 00:09:44 Absolutely. That's something we've seen even in recent months. For example, you're familiar with the Whisper model for audio transcription. Yeah. So it's basically, at the architectural level, a large language model. It has autoregressive inference. It creates tokens. So with that familiarity with the model,

Starting point is 00:10:05 recently we realized, hey, this is basically an LLM. What if we just treated it as an LLM for model performance? And we used all the same optimizations that we do like TensorFlow OT LLM for serving Wispode. And it turns out it works great. So like you said, it takes a long time to get sort of familiar and comfortable with this space and be able to start making these connections. But once you do, you get to build and discover really cool things.

Starting point is 00:10:36 Yeah. It's kind of that like building up that general knowledge from like a first principle standpoint. So that then once you've built that expertise, you can start to draw lines to other things that are connected to it or something that maybe from a surface level looks like two different things, but is actually, when you have the expertise, you're able to see, actually, this problem is kind of the same as this other problem. We can solve it this way. I love when that happens, when I run into a new problem and then I say, actually, wait,

Starting point is 00:11:06 you're an old problem wearing a Scooby-Doo mask. You just take off the mask and hit it with the tried and true solution. Yeah, exactly. So I want to discuss a bunch of AI-related topics with you. And you're starting with model selection. I think one of the challenges, especially people kind of jumping into this space right now, is there's just so many models out there. There's closed, there's open source, big, little. When starting some kind of new project, how do you even pick? Where do you start? I generally like to recommend that people start with the largest, most capable model when they're in the experimentation phase.

Starting point is 00:11:46 Unless there's some constraint that means you can't do that, like if you're building exclusively for edge inference on an iPhone or something. If you're starting with the most capable model, that eliminates one constraint during this experimentation and prototyping process, which is, oh, is my model not good enough? Because you're using the best model. So either the model is capable, or if it's not capable, you need to either go do some custom fine-tuning or training work to make it capable, which is a whole other discussion.

Starting point is 00:12:21 Or the exact thing you're trying to build isn't quite ready yet, so either wait for the next iteration of the model or figure out a way around those limitations. But anyway, if you start with the biggest, most powerful model, you're able to eliminate that entire category of questioning from your experimentation. You're able to figure out the core flows of your product.

Starting point is 00:12:46 You're able to figure out the capabilities that you're looking for, the inputs and outputs that you might expect. And then once you're done with that, then you have all of those other things locked in. Then it starts to make sense to check out different models. Because the thing about model selection is you really need strong evaluation criteria. You can't really just work on vibes and say, hey, this model just feels better. At least you can't at scale.

Starting point is 00:13:16 So when you're able to lock those other things in and build a strong set of criteria. And when I say evals, I don't mean just like building an evals benchmark in the same way as MMLU or any of the other bunch of benchmarks that every new model gets run against. I mean, figuring out for your specific product, how you can tell if it's working and how you can tell if it isn't, and then running different models through that. Then you can start trying smaller models to see like, okay, can I save substantial amounts of money on inference by using a smaller model here? You can also try composing multiple models together. Say you've been using a very large sort of everything model, I like to call them, where

Starting point is 00:14:01 you have, say, a multimodal model that can do vision and language and maybe audio and speech and everything. Well, maybe you can decompose that, have small, targeted, much cheaper to operate models that handle each of those modalities and then compose those models altogether. You have a lot of options for optimization after that initial prototyping phase. But when you're just playing around, just play around with the best model. Yeah, I mean, I think it's kind of about like, you're trying to reduce variance, essentially, like if you're, there's a lot of things that go wrong, let's let's take essentially the quality of the model off the table. It's kind of like, if you're running an experiment in a company that requires like a person execute something, it's much better to put probably your best person on that thing. So that way, if it fails, you know it wasn't the individual that led to the failure.

Starting point is 00:14:53 Whereas if you put somebody who's maybe not so good at that, then it could be that person was just bad at executing whatever that project happened to be. In terms of figuring out whether things are working or not, what are the strategies there? How do I know whether something is working well or not? And how do I create essentially my own benchmarking mechanism for that so that when I do go and try maybe to try a smaller model or a different model, I can test whether I'm reaching the performance I need or was able to reach before? So there's a couple different approaches here. I think the strongest approach is to use a evaluation framework or evaluation library,

Starting point is 00:15:33 build a set of test inputs, build a set of expected outputs. There's a sort of LLMS judge pattern that can be very effective here for making sure that you're able to run through large numbers of checks and then have a large language model, perhaps the original model that you were using, judge the quality of those outputs at scale. But it's a tough problem, and there's a reason that there's so many teams and companies working on evals, because it's absolutely critical, but it also can be very dependent on your product. One thing we work a lot with is transcription models. And so we're always looking at word

Starting point is 00:16:16 error rate. But you have to drill a little deeper into that. So is it word error rate in general? Is it consistently messing up names? Is it consistently messing up names? Is it consistently messing up the titles of medicines? There's certain errors that are more impactful than others. So it's a combination of a little bit on the technical side trying to set up the framework for evaluation, but mostly it's a domain understanding problem where you have to be really clear about what you're trying to accomplish with your system in terms of like cost savings you know if if i decide like okay i want to like you know save some money i'm gonna go and use like an open source model then i need to run it. I've got to get some GPUs. I've got

Starting point is 00:17:05 to run my own cluster and stuff like that. Then I've got to manage that thing, handle updates and stuff. Am I actually going to save any money that way? By taking on that work, is it really an apples-to-apples comparison there? Or is that

Starting point is 00:17:23 a false way to think about potentially saving costs? That's a great question. It's really total cost of ownership, right? So I think about it less as going from closed models to open models and more about going from shared inference endpoints to dedicated deployments. Because you can get a shared pro-token endpoint of LAMA 3.1 or, you know, on the Straw model, you can get for different image generation, Whisper, you can get shared inference endpoints

Starting point is 00:17:54 for basically all of the big-name open-source models. And then just like when you're using the chat GPT or the OpenAI GPT-4 API, when you're using Anthropix Cloud APIs, you're still paying per token, you're using the OpenAI GPT-4 API, when you're using Anthropix Cloud APIs, you're still paying per token. You're just hitting an endpoint. It's all in the same format.

Starting point is 00:18:13 And that is really great in the early stages. If you're consuming a few thousand, even a couple hundred thousand tokens a day, and you're paying per million tokens, you're probably doing great on those shared inference endpoints. And the additional cost of having your own GPUs is not going to make sense. But as you scale up, then the economies of scale start to work in your favor. So I generally tell people to switch to a dedicated deployment once they have enough traffic to consistently saturate a single GPU of whatever type makes sense for their model. Because then you're starting to buy your tokens in bulk. But you also get a bunch of other benefits that can make sense, including making

Starting point is 00:19:05 sense at somewhat smaller scales. The number one thing, obviously, is privacy. If you are the one running the GPU, then your inputs and outputs can't be, you know, read by OpenAI or anyone else. There's also the customization aspect, where not only are you able to control the model, in that you can do fine tuning laws, you can do an entirely custom model, all that sort of stuff. But you can also control things like setting the batch size to trade off between latency and throughput to get your exact cost characteristics while still hitting your latency SLAs. So there's a bunch of, you know, aspects of control, privacy, reliability, not having noisy neighbors.

Starting point is 00:19:46 Someone hits the endpoint with a million requests and now all your stuff is slow. All that stuff just doesn't happen when you have dedicated deployments. So they can have a better total cost of ownership story once you reach scale or if you have certain regulatory or compliance things or just privacy concerns. And in terms of saturating a GPU, what's that look like? Yeah. So I'm not saying you need 100% utilization of a GPU at all times, but generally what I'm looking for is a sort of high volume traffic pattern that is, well, it doesn't have to be super consistent and predictable. Scaling GPUs up and down, while difficult, is pretty doable. So you can get pretty fast cold starts denominated in seconds in many cases for these GPUs. It's really more about just having enough traffic so that you're running,

Starting point is 00:20:47 you know, batches of inference. Generally, you're running multiple requests at the same time, because if I have a whole H100 just for, you know, a whole, in this case, two or four H100s, just for me to play with the new Nemotron model, then that's a, that's a big waste of, of compute resources. But if I'm, you know, sending through some, some pretty consistent, stable traffic, then I'm actually, you know, getting the, getting the benefits of buying my tokens in bulk. Okay. All right. So, so we, we, we've reached, you know, we got our model, we figured out whether we're going to, you know,

Starting point is 00:21:24 rely on a shared inference endpoint, or we're going to, you know, essentially run this myself. When does it make sense for somebody to fine tune a model? So I saw there's this great tweet that goes around a lot. No GPUs before PMF, which is the idea that, you know, you shouldn't be making these very capital-intensive investments in training and fine-tuning ahead of figuring out what you actually want to build. And, of course, there are exceptions.

Starting point is 00:21:53 If you're a foundation model lab, obviously, you need the GPUs to get the PMF. But if you're just building a product, I do think that fine tuning is in many cases, kind of the last thing you should reach for. It has a ton of useful applications. And there's a bunch of places where it makes sense to use. But first, you know, obviously prompting, and then patterns like retrieval augmented generation, if you want to feed in real time data. But overall, when I think about fine tuning, you know, it's less about changing the information in the model. You know, that's really what WAG has sorted out. It's more about changing the behavior of the model.

Starting point is 00:22:33 So you can fine tune a model to add, say, like function calling. If it doesn't support it out of the box, you can fine tune a model to do better at math or code or something or some specific domain. We work with Vito who they trained from scratch over fine tuning, but makes these domain specific models for stuff like medicine and finance that that, you know, can also be achieved in some cases through fine tuning. But, you know, it's really, again, more about changing the behavior of the model versus changing the knowledge of the model where fine-tuning starts to be super useful. So you mentioned earlier this idea of stringing multiple models together, this concept of compound AI. And that's something I've heard a lot of buzz of around this year. There was a paper from Berkeley's AI research group, I think earlier this year, that talked about this concept.

Starting point is 00:23:29 So I wanted to talk a little bit about that. Can you explain what is compound AI? Why are people excited about it? Yeah. So compound AI, a little bit of a buzzword, but there's a lot of substance there in that you want to basically introduce new capabilities to a model. So we, you know, we're just talking about fine tuning is one way to do that. But as a very simple example, let's just talk about math. So, you know, LLMs can't add two numbers together with any degree of reliability. That's just not what they're meant for. So, you know, you can use function calling, you can use tool use to send, you know, a bunch of

Starting point is 00:24:11 mathematical functions to the LLM with an explanation of what each one does, along with subparameters. It'll select a function, send it back. You can run that math in, you know, your Python code interpreter, send back the answer, and ask the LLM to find an explanation. Well, in this super simple, trivial, contrived example, we've made two calls to the LLM, and we've done one piece of business logic in between. That's an example of a multi-step inference. There's also examples of multi-model inference. For example, if you're doing AI phone calling, like Bland is a great example of this. You might have seen their billboards around where it's like, why are you still hiring humans? They have this AI phone calling

Starting point is 00:24:56 platform where you call it up and it can do customer support, it can do ordering a pizza, that kind of stuff. And if you think about it, doing a real-time phone call with AI is three different parts. You need to take what the person on the phone is saying and turn it into words written down, so text. You need to respond to that text using a large language model.

Starting point is 00:25:20 And then you need to synthesize that text into speech. So you could accomplish this with a single like Omni type model, and then you need to synthesize that text into speech. So you could accomplish this with a single like Omni type model, but you could generally get much better performance and cost characteristics if you actually compose three models together, one for transcription, one for LLM, and one for speech synthesis. So either of these can be examples of compound AI. Compound AI is a sort of framework of thought. I still haven't come up with exactly the right noun that I want to use for compound AI. for composing multiple models, multiple steps, or both into a single chain of influence with, you know, conditional execution logic, business logic, all that stuff baked in.

Starting point is 00:26:12 Yeah, and sometimes, like, correct me if I'm wrong, it's not always necessarily chaining sort of Gen AI models together, but it could actually be like a Gen AI model with traditional ML also plugged in. Like, I think AlphaCode, I believe, does something like that, be like a gen a model with traditional ml also plugged in like i think alpha code uh i believe does something like that where they use an lm to generate like a million potential solution programming solutions to these like uh you know programming problems and then you use a second i i don't think it's an lm based model but they use something else essentially to filter those down to get to the solution that they believe to be the most correct. Absolutely. I think if you want to look at it at the very highest level, compound AI is

Starting point is 00:26:51 about adding specialization back into generative AI. You have these highly capable general models, like an LLM can do a pretty good job at almost anything. But if you're building these, you know, production-ready AI systems that are going to need to be fast, reliable, accurate, then you kind of need to add specialization back in to have an edge. And it's really exciting to work on because now we're building, you know, real applications out of AI, sort of AI-native applications, rather than just wrappers around a single capable model. Yeah. It's kind of interesting, this evolution, because historically, before we had these kind of Gen AI models, where ML really excelled was something that was a really specific task.

Starting point is 00:27:42 You could essentially tune something to do fraud detection, spam detection, these types of things, classic classification, and it could perform really well at that. And then the incredible thing that blew everybody away with large language models was suddenly at this super general thing that you could talk to, it would answer whatever, it would generate some sort of response. But in order to actually, I think, serve a lot of practical use cases, you need sort of both approaches because the general model is just going to be like, if it doesn't have a good answer, it might hallucinate something or like the quality is going to be not at the level that you need.

Starting point is 00:28:21 So then you need to do something to essentially make it more specialized. And it seems like at the moment where we're seeing the best performance is sort of these like hybrid systems or this compound AI approach that brings both of these worlds together of the like specialization and the generalization. Exactly.

Starting point is 00:28:40 You can almost think of the general models as a sort of front end or user interface for these more specialized tools. And, you know, that kind of-model resources like data storage and code execution API calls. And that general language model is just the interface that you're using to interact with all of these things in a new way. Yeah. So when it comes to, I think combining models makes a lot of sense, but like, what about the performance? Like already, you know, talking to a model can be relatively slow for a lot of applications. And now we're talking about like chaining multiple models together or even, you know, there's models that even in the large models that sort of compete against each other.

Starting point is 00:29:40 Some of them are better at asking, answering certain types of questions than others. Probably the best case solution would be you'd put in a prompt and you would fan out to all the models, they'd all answer, and then you'd have another model pick the best answer or something like that. But that would be really expensive operationally and also expensive from probably a cost perspective. So how do you get the performance out of these systems that can actually serve like these use cases? Yeah. So, you know, on the quality side, what you're talking about is sort of model routing. And there's a bunch of great, there's a bunch of great teams working on that. I can shout out my friends at Unify are an example of people who are working on that, where if you have like a simple query, maybe it gets sent to a small model if you have a more complex one it gets sent to a large model

Starting point is 00:30:30 but the other half of of performance is the is the latency side right and the cost where the difficulty of building with compound ai is that you have to iron out the performance bugs and the, you know, 10 milliseconds of network latency here and the cold start time there and the, you know, queue backing up at this stage in your graph. That's where having, you know, really strong model performance tooling and really strong distributed infrastructure tooling has to come together to make these systems possible. So in terms of inference, like what is the expensive part of inference? Like why is that particularly like a slow operation? Yeah, it depends. Let's, you know, for the sake of keeping this podcast under five hours, I'll a little bit just about large language models and other similar auto-regressive models.

Starting point is 00:31:32 We talked about Whisper before. It has very similar performance characteristics. So there's kind of two phases of LLM inference. There's the pre-fill phase and there's the token generation phase. So the pre-fill phase is when the model gets the prompt and needs to come up with the first token. And you're kind of parsing every token within the prompt. This part is GPU compute bound. So in general, the sort of flops that your Teraflops that your GPU is capable of is going to limit the speed of pre-fill.

Starting point is 00:32:06 And then there's the autoregressive phase, the token generation phase, where the whole previous input plus everything that's been generated gets passed through the model over and over and over again. And this phase is actually going to be bound by GPU memory bandwidth. So for example, an H100 GPU has 3.35 terabytes per second of GPU memory bandwidth. And that is the limiting factor in many cases on how quickly it can execute a model like Lama 405B where all of those weights have to be loaded

Starting point is 00:32:42 through memory over and over again. So again, if you have, you know, it depends, right? So if you have a case where you're sending a very long prompt and expecting a very short answer, you could potentially be bound on GPU compute and have a sort of pre-fill times the first token problem. Right. Or you could, you know, have a more traditional like multi-tone chat, for example, where the LLM is going you could have a more traditional multi-tone

Starting point is 00:33:06 chat, for example, where the LLM is going to be doing a lot of generation, then you have a memory bandwidth problem. I'd say more often than not, we're running into memory bandwidth versus token versus prefill,

Starting point is 00:33:21 but definitely both make an appearance in just about every performance optimization puzzle. And what are some of the techniques to, to try to optimize the inference cycle? Absolutely. So the range of techniques runs from ones that work really well, basically all the time to ones that have some pretty meaningful trade-offs. And it depends.

Starting point is 00:33:49 So it depends on your traffic pattern. It also depends on what you're optimizing for. Like you can just say, I want to optimize my model, but that's not really what optimization is. You kind of have to pick a goal and work in that direction at the expense of other things. You could optimize for latency. You could optimize for throughput, for cost.

Starting point is 00:34:07 Assuming we're just trying to get the best possible latencies at a reasonable throughput and cost threshold, which comes from batching, the first thing that basically always works is going to be adding some kind of inference optimization engine. So are you familiar with VLLM and TensorTLLM? is going to be adding some kind of inference optimization engine. So are you familiar with VLLM and TensorFlow TLLM? Yeah, I mean, at a sort of high level.

Starting point is 00:34:32 Yeah. So for anyone who's listening who hasn't really heard of these, basically you can take your model weights and you can take the Transformers library for writing inference code in Python, and you can just write some Python code and run it on your GPU, and it'll produce tokens. But there's also inference optimization frameworks like VLLM. It's a very popular open source one.

Starting point is 00:34:56 There's also Tensor RT-LLM, which is also open sourced by NVIDIA. And what these do is provide a set of sort of out-of-the-box optimizations that make the model run faster with almost no downside. With Tensor RT-LM in particular, that's what we use very heavily at Base10. It works by building an engine for your model inference. This engine is a little limited in that it's built for a specific GPU and a specific model. So once you kind of build the engine, you've got to rebuild it if you want to change anything. And the engine can be a little large, which can increase your instance, your image size and your cold start times a bit. But once you actually get it up and running, it's really, really fast because it actually creates optimized CUDA instructions for every part of your model inference process. It's basically almost like compiling your model inference code into these optimized CUDA instructions.

Starting point is 00:35:58 And that's something that other inference frameworks don't do. So that's why TensorFlow RTL-LM in many cases can have this really excellent performance. So that's why TensorRT LLM, in many cases, can have this really excellent performance. So that's kind of step one, is you want to just make sure you're using a serving engine. You want to make sure you're using, of course, appropriate hardware. And then just one question on TensorRT. So the thing that I've heard, or the criticism I've heard about

Starting point is 00:36:23 is it's hard to use. So what makes it hard to use? It definitely has a steep learning curve. And again, there's this engine portability problem where you need to create the engine, but then you also need to package it up in a Docker image and go ahead and serve it. I think that the team at NVIDIA is doing a lot to make it easier to use. You know, we've seen things like input and output sequence lengths, which previously had to be kind of set in stone during the engine building process, become one more flexible, flexibility on batch sizes, that kind of stuff.

Starting point is 00:37:03 It's also complicated to use because it provides a very generous API where you're able to twist a lot of knobs yourself and set it up. But yeah, I mean, we definitely have also experienced some of the pains of that developer experience. And that's why at Base10, we built something called our TensorFlow RT LLM Engine Builder, where we create a YAML file and specify, you know, the sequence of links, model weights, GPU count, GPU type, and various optimizations, quantization,

Starting point is 00:37:41 that kind of stuff you might want to put in. And then it builds and deploys it automatically for you for, you know, like supported models as well as fine tunes of those models. And that's actually, you know, speaking of our, you know, technically excellent sales people, we've actually had some of them be deploying models, you know, for demos and customers and stuff using this. So there, you know,

Starting point is 00:38:04 while the TensorFlow TL LLM developer experience can be a little rough under the hood sometimes, for a lot of cases, it's possible to use this tooling to make it much easier to build these engines. OK. All right, so based up, basically, get your user and inference engine to help do some of this automatic or sort of low,

Starting point is 00:38:30 I guess the best way to say it is it gives you optimization without a lot of downside. What's next? So what's next is you start looking at trade-offs between you've got four sort of levels to play with you. You've got latency, you've got throughput, you've got cost, and you've got quality. So if latency, throughput, and quality

Starting point is 00:38:55 are absolutely critical, and the only thing you can be flexible with is cost, throw more GPUs at the problem. I love the solution because I'm one of the people you can get more GPUs at the problem. I love the solution because, you know, I'm one of the people you can get those GPUs from. But in many cases, you don't want to do that. You want to, you know, keep your costs. You know, that's why you're coming to a dedicated deployment

Starting point is 00:39:17 is so that you can kind of have control over your costs and buy your tokens in bulk, like it's Costco or something. So generally generally you start thinking about how to trade off latency and throughput versus quality. And you have to do this very carefully because, you know, when I say trade off versus quality, hopefully these optimizations that we're doing next have zero impact on the quality of your model output. So it's not so much that you're intentionally losing quality, more that at every step you have to verify your output, run your same evals as you've been running before, and make sure that your model output quality hasn't suffered through

Starting point is 00:39:59 this process. Because the first and most important optimization that's out there is quantization. And while we are, you know, constraining our discussion to large language models at the moment, this does actually apply for basically any kind of model. You can quantize, say, like a stable diffusion model and save on inference there as well. So quantization is the idea of changing the number format that your model weights and certain other things like activations, KV cache are expressed in. By default, model weights are almost always

Starting point is 00:40:36 expressed in a 16-bit floating point number format. So your model, let's say, Lama 405B, has 405 billion parameters. Each one of these parameters is a 16 bit number, 16 bit floating point number, and that's going to be two bytes of memory. So this whole, all these model weights are, you know, 900, yeah, 900 gigabytes of memory because it's's 405 billion times two bytes. Yeah.

Starting point is 00:41:07 So as we discussed earlier, one of the big bottlenecks in inference is going to be your GPU memory bandwidth. So if you could move less weight through that bandwidth, you could move a lot faster. You could do your inference a lot faster. And actually, at the beginning, we also talked about pre-fill and how that's compute bound. Well, your GPU at a lower precision, it has, every time you cut your precision in half, you double the number of flops that you have access to. So it's a linear improvement. You're not from quantization generally to double your performance. In fact, it'll usually be pretty far from that. But in terms of just like the GPU resources that are available, you would have double the resources if all of your numbers were half as big.

Starting point is 00:41:54 So that exists. It's called quantization. It's the idea of converting your data format from, say, a 16-bit float to a 8-bit number format or even a 4-bit number format. Now, the potential downside is a loss of quality because your number format now has a lower dynamic range. So dynamic range is the sort of breadth of values that are able to be expressed. And that's why we've found that the new FP8 format is very impactful here. So rather than going to Int8, which is an 8-bit integer number, which has a pretty constrained dynamic range, if you go to FP8, you get a great dynamic range because you're still doing floating point. And this can help with keeping your model's perplexity gain under control

Starting point is 00:42:43 and keeping your output quality very high during quantization and allow you to quantize more than just the weights and with the new blackwell gpus that are coming out on the sort of next generation architecture we're expecting really great results from fp4 which is a four bit floating point quantization format currently if you want to go all the way down to four bits, which is popular, especially for local inference, like learning models on your MacBook or something, then you're constrained to those integer formats, which do have that very, very limited dynamic range. When you lose quality through quantization, how does that generally manifest itself?

Starting point is 00:43:28 So the best way to check for that is through something called perplexity gain. So perplexity is a calculation that you can do on a large language model where rather than generating outputs, you actually give it outputs. And then you see how, for lack of a better word, you see how surprised it is by those outputs. You see the number of times where it says, hey, this is not the token that I would have generated here in this known good output.

Starting point is 00:43:53 And with your quantized model, you know, you want your now there's different ways of calculating perplexity. So it's difficult to do absolute comparisons model to model, for example. But within the same model, if you quantize it, you run your same perplexity check using the same algorithm and same data set again, you generally want your perplexity to not really move at all. There's a small margin of error, but you don't want to see a gain in perplexity where the model is saying like, hey, before this is the token I would have made and it's a good token, but now this is not, you know, what I would put out. That's how you see that, okay, you know, maybe some of these model weights that you compressed, you got a little, oh, it's not compression. I shouldn't say compressed, but these model weights that you

Starting point is 00:44:41 made a little less clear actually were very important and affecting the output. sort of measured quality through things like perplexity or the observed quality through just, you know, consistent customer use and seeing that the model output is satisfactory. What about speculative decoding? Absolutely. So that's kind of like the next step. So once you have your inference engine in place, once you have your quantization, and by the way, TensorFlow RT LLM can actually quantize your model for you during the engine building process, which is pretty cool. Then you start to look at all of these more exotic approaches to generating more tokens. And this is where it gets really exciting, but it also gets even more dangerous to your output

Starting point is 00:45:40 quality. So speculative decoding, there's a few different versions of this. There's one where you can do it with a draft model. There's also a variant that we've used a lot called Medusa, where you kind of fine-tune your model to have something called Medusa heads, which are able to create these tokens. But overall, the concept of speculative decoding is, if you run, say, Lama 3.2 3B and you ask it a question, it's going to give you a pretty reasonable response. It's a highly capable model, even though it's super small. But it's not going to be as good a response as Lama, say, like Lama 4 or 5B, which is 150 times bigger. But most of the tokens are going to be the same. At least a lot of the tokens are going to be the same.

Starting point is 00:46:30 It might start at, hi, my name is. It's going to give a lot of questions, answers that might have 50% the same tokens. So what if you could get all those tokens from the super small cheat model and only turn to the big model when the small model gets it wrong? So the small model in the speculative decoding setup generates these draft tokens, and then it checks them with the big model. Kind of in the same way that it's a lot easier to tell if a Sudoku is done correctly, you can just scan it really quick and count up all the numbers versus solving it yourself, it's a lot easier to verify that a token is correct. And by easier, I mean like a lot

Starting point is 00:47:11 less computationally intensive to verify that a token is correct than it is to generate that token from scratch. So as long as you have a really good verification process, then you're going to get the same output after implementing specular decoding as you did before. Instead of having a draft model, you sort of fine-tune your base model so that rather than generating one token at a time, it generates three or four tokens at a time. One thing that's really interesting with that is you have to fine-tune it with some sort of domain awareness because otherwise your model might perform just fine at, say, math and then software and humanities or something. So all of these approaches have potential downsides. For example, with speculative decoding, if your draft model gets every single token wrong, well, now you've actually made it slower because you're running the draft model, validating every token and regenerating every token from the larger model. But with careful implementation, there's definitely a lot of potential for upside with these approaches.

Starting point is 00:48:25 You can see absolutely massive improvements to your tokens per second. How much do you think some of these techniques are here to stay versus a moment in time because we haven't developed essentially the chips that can handle the performance that we need? There are a lot of custom hardware manufacturers who are trying to take the fight to NVIDIA, which is interesting. And NVIDIA itself, of course, is making new and more powerful graphics cards all the time for this. I do think, though, that within any given hardware or any given model, there's still room for these techniques. So the model and the hardware establish a sort of baseline performance. And then from there, you can layer in all this stuff.

Starting point is 00:49:22 So I think I will see like as new models come out, as new hardware comes out, these techniques or, you know, all of this is very new applied research. So newer and more efficient techniques are coming out all the time. But you'll see these approaches or approaches like them applied over and over again

Starting point is 00:49:40 as these new models and new hardware come out. Because the thing is, it's a lot of work to do this kind of performance optimization and to do it well. You're not going to always see it like day one for every new piece of hardware and every new model. But I do think that over time, these benefits are really going to compound, honestly.

Starting point is 00:50:00 If you look at TensorRT LLM, for example, it does, if you run a model on an A100, which is the sort of last generation Ampere big GPU versus an H100, which is the current generation, than the baseline hardware specs would imply because the model is taking advantage, because TensorFlow RTLM is taking advantage of the architectural features of the H100 GPUs. And then you get that benefit. That's multiplied by the benefit you get from quantizing. That's multiplied by the benefit you get from specular decoding. So as newer, more efficient models come out, as newer, more powerful GPUs come out, you're going to see even more benefit

Starting point is 00:50:49 from using these approaches rather than less. Right. So, I mean, I think like most people, you know, running a model, you know, building some sort of AI-powered application through RAG or something, they probably don't necessarily need to be like necessarily need to be looking at the GPU level. But if you are really trying to squeeze every bit of performance possible

Starting point is 00:51:14 out of the inference step, you probably do need to look at the hardware level. So you mentioned a couple of different chips. There's a lot of different GPUs, just like there's different CPUs. How do those impact the performance on inference and does it is there like certain gpus that make more sense for certain models or does that matter at all absolutely it makes a huge impact and like you said you know when you're just starting out it probably doesn't make sense to look at this gpu level but it's really a question of where do you want your competitive advantage to come from? And if you're committed to a certain model or a certain inference pattern

Starting point is 00:51:53 for a long enough period of time to see the benefits, that's where these optimizations can be really useful. In terms of the exact GPUs and their various advantages and disadvantages, so the current generation is Hoppo, and that architecture currently is available on the H100 GPU. There's also the H200 GPU, which is even bigger. There's also a GH200, which is a sort of hybrid between the two. And this is really great for running the biggest and latest models, especially because the Hopper and Lovelace,

Starting point is 00:52:30 which is the previous architecture, are the ones with support for FP8, which is that really effective quantization format. So within that, though, the smallest memory that's available is 80 gigabytes, which is just complete overkill for a huge class of models, including a lot of image models, which tend to be 12 or fewer billion parameters, as well as, of course, audio transcription models. You know, the largest Whisper model is only a couple billion, if that. So you also have the L4 GPU. That's the previous generation Lovelace,

Starting point is 00:53:10 which is, you know, it's a 24 gigabyte GPU. It has much lower memory bandwidth. So it's not great for large language models in many cases, but it's great for some of these image models, some of these audio models, some of these speech synthesis models. You have, you know, the previous generation Ampere cards, which have an A10, you have an A100. Both are great lower cost alternatives to the L4, the H100.

Starting point is 00:53:37 Sorry, no, A10s cost more than L4s. But A10s have higher memory bandwidth, so they're good for small language models. You have got like older GPUs like T4s, which are great, you know, for just low cost workloads of small models. And what we see often is people kind of graduating from one GPU to the next. You know, maybe you've trained a small model that does some specific modality, some new thing, and it runs just fine on the T4. Well, eventually, you know, rather than getting your 30th T4s as traffic is picking up, you can start to get better performance and better cost characteristics by consolidating to A10s or L4s. So as you know, it's not just about model. Obviously, model matters a lot because your GPU needs to be big enough to hold your model and your KB cache and your batch inference.

Starting point is 00:54:30 But, you know, we also see people kind of upgrade over time. And one thing we've done to facilitate that is with H100 GPUs, you can actually split them up through a process called MIG. And no, MIG, unfortunately, is not a cool fighter plane. It stands for Multi-Instance GPU. With that, you can basically cut an H100 in half, and each half has 40 gigabytes of

Starting point is 00:54:58 VRAM. It's got like three-sevenths of the compute. It's got half of the VRAM bandwidth. And these GPUs offer these MIG to H100s offer really great performance characteristics for these smaller models. Generally, anything under like 15, 12 billion parameters can fit on it. They offer you that FP8 access, and then they also

Starting point is 00:55:23 have really good Fiske characteristics because you're only paying for half a GPU. Yeah. So I think we've covered a lot on performance from essentially choosing a model to choosing essentially an inference engine to really tuning the inference cycle through quantization and some of these other techniques to picking a GPU.

Starting point is 00:55:51 Is there anything else on inference optimization you think that we should touch on? Yeah, I mean, there's really the whole network layer. There's the whole infrastructure layer that's very important. If you do all of this work to squeeze every second of time to first token out of your model by optimizing pre-fill, and then you take your GPU and you stick it in, you know, US East 1 in Virginia, and your user is in Australia, then all that model performance work is right out of the window because of the network latency.

Starting point is 00:56:28 So there's really like two completely separate domains that have to be considered together in highly optimal model serving infrastructure because it's both this model performance layer and it's also the infrastructure performance layer. So obviously there's locating your GPUs reasonably near to your user. That's important. With compound AI, it's minimizing the network overhead in the multiple hops that you're making from model to model. So, yeah, it's definitely that networking and infrastructure component where you have to consider, for example, cold starts. If you're getting a burst of traffic, you need time to scale up those GPUs and load in these very large images and these very large sets of model weights. So being good at that part of it

Starting point is 00:57:18 is absolutely essential for doing this production because otherwise your you know really flashy performance metrics are not going to be what your user actually experiences yeah absolutely that's a really good call out like at the end of the day unless you're running this you know essentially in a closed system on an edge device most likely you're making some sort of network call and then you know if you're if that caught if that, you know, if you're, if that caught, if that's, you know, 900 milliseconds, it doesn't really matter what the, what the performance is like on, on the actual inference piece. Exactly. We've even seen stuff as simple as, you know, switching from the requests library in Python to doing a connection pooling with HTTP,

Starting point is 00:58:02 saving, you know, that extra 10 or 20 milliseconds and making the difference between hitting the latency SLA and not hitting it. So, you know, there's a whole lot of work to be done at the infrastructure and the networking level to actually deliver this excellent performance to the end user. Also, like, how much does the programming language choice factor as well? A lot of people's go-to when you're talking about AI and machine learning is Python, but Python's notoriously not the most performant language in the world. Is that something people need to also be considering? In some cases, that's where the influence engine can help a lot.

Starting point is 00:58:46 For example, TensorOT LLM is a C++ model server, basically. Well, okay, so there's TensorOT LLM, there's also Triton, which is the Triton influence server. Those two generally go together. And we actually built our own variation of the Triton Influence server that's even more lightweight, but that's also built in C++. So, yeah, there's a lot of C++ under the hood in these kind of stacks, and that's going to help a lot with the performance. And, of course, the GPU inference part, that's going to be a problem for CUDA to handle. And that's where, again, TensorFlow TL1 helps out. So, yeah, it's, you know, speaking as someone who is really only comfortable programming in Python and JavaScript,

Starting point is 00:59:33 I'm able to, you know, leverage a lot of these tools that people much smarter than I have have put together and stand on the shoulders of giants and be able to, you know, write my little Python code and not really worry about performance too much. Awesome. All right, well, let's switch gears a little bit and go quick fire.

Starting point is 00:59:52 So if you could master one skill that you don't have right now, what would it be? So I would really love to be better at math, which I realized I had a great opportunity to do in college and totally squandered getting a lot of Bs in math classes. But I was working on this transformer inference blog post a year ago with someone at base 10. And he's an engineer there. And he was sort of explaining to me the actual math behind figuring out like, how do we prove that this inference is bound by GPU memory bandwidth

Starting point is 01:00:24 rather than something else. And just looking at these papers and all of the Q, K, and V and stuff, I like it when math has numbers in it, not letters. So it would be really cool if I could be a lot better at some of this math behind the influence. But I'm learning it slowly but surely. Yeah, never too late. What wastes the most time in the day?

Starting point is 01:00:48 Definitely uncertainty. You know, working at a startup, there's always a lot of stuff going on. It's like, oh, what should I work on? What should I do? Is this right? Is this good enough? Should I spend time on this or that? You know, uncertainty and just not pulling the trigger and making a decision is what I probably waste the most time. If you could invest in one company that's not the company you work for, who would it be? There's this great company called Pylon. They do customer support. We're customers of them.

Starting point is 01:01:16 I'm friends with the founders. They were actually in Chicago the other day. And I tried to tell them, like, hey, I'll trade you a couple hundred you know, a couple hundred of my shares for a couple hundred of your shares, and they didn't go for it. But they're a great company. They're working super hard. They have a lot of customers, including us, who really love them. And they're taking on some incumbents like Zendesk that I think are ripe for disruption. So, yeah, big fan of Pylon.

Starting point is 01:01:45 Cool. What tool or technology could you not live without? Honestly, that sounds super simple. It's Markdown support in Google Docs. I was so happy when that first came out because I use Google Docs for basically everything I do just because everyone has it and it's super easy to work with. And just not having to train the

Starting point is 01:02:06 markdown shortcuts out of my fingers to use it has been really nice which person influenced you the most in your career I want to I want to shout out uh you know a couple people um there's this guy uh Lee Robinson um who's also from Iowa I'm I was raised Des Moines, Iowa. So I often say that I want base 10 to become Vosell and I want Philip to become Lee. So that's a pretty simple way of me expressing my career goals. But I've also been influenced a lot by Patrick McKenzie, who goes by Patio11 on a lot of internet spaces. I've really appreciated his writing on startups and entrepreneurship and found a lot of internet spaces. I've really appreciated his writing on startups and entrepreneurship and

Starting point is 01:02:45 found a lot of, uh, a lot of wisdom there. So yeah, those are, those are some of the very, very long list of, of influences.

Starting point is 01:02:54 Yeah. Good pick. So, and then finally five years from now, will there be more people writing code or less? Never bet against coders. Yeah. I mean,

Starting point is 01:03:03 I always, I always feel like, um, you know, that, an engineer's job is to solve problems. That's why you get paid the money that you get paid. So at the end of the day, that's not necessarily about sort of hands on keyboard. It's about essentially solving problems. So maybe the manifestation or the evolution of that is you're doing a lot more prompting to generate code that you're sort of working with,

Starting point is 01:03:28 but you're still, you know, essentially solving problems and putting these things together. Absolutely. I mean, there's some days with, with my job that spans all these different departments and functions, though,

Starting point is 01:03:39 there's definitely days where I don't write code, but I still feel like I'm a software engineer. Even, even when I'm solving these other problems, just because it's more about the mindset and the approach you have to solve the problem rather than the exact problem you're solving. Yeah, I don't think the engineering mindset doesn't go away. All right. Well, Philip, thanks so much for being here. I enjoyed this a lot. Cheers. Yeah, thank you so much for having me. This was a really fun time.

Software Huddle - Deep Dive into Inference Optimization for LLMs with Philip Kiely

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.