Software Huddle - Deep Dive into Inference Optimization for LLMs with Philip Kiely
Episode Date: November 5, 2024Today we have Philip Kiely from Baseten on the show. Baseten is a Series B startup focused on providing infrastructure for AI workloads. We go deep on Inference Optimization. We cover choosing a mode...l, discuss the hype around Compound AI, choosing an Inference Engine, Optimization Techniques like Quantization and Speculative Decoding all the way down to your GPU choice.
Transcript
Discussion (0)
I think one of the challenges, especially people kind of jumping into this space right now, is there's just so many models out there.
You know, there's closed, there's open source, big, little.
You know, when starting some kind of new project, like, how do you even pick?
Like, where do you start?
I generally like to recommend that people start with the largest, most capable model when they're in the experimentation phase.
Unless there's some constraint that means you can't do that,
like if you're building exclusively for edge inference on an iPhone or something.
If you're starting with the most capable model,
that eliminates one constraint during this experimentation and prototyping process,
which is, oh, is my model not good enough?
When does it make sense for somebody to fine-tune a model?
So I saw there's this great tweet that goes around a lot,
no GPUs before PMF,
which is the idea that you shouldn't be making
these very capital-intensive investments
in training and fine-tuning
ahead of figuring out what you actually want to build.
So in terms of inference,
what is the expensive part of inference?
Why is that particularly a slow operation?
Hey, everyone. Thanks for joining.
Today, we have Philip Keeley from Base10 on the show.
We go deep on inference optimization.
We cover choosing a model,
discuss the hype around compound AI,
choosing an inference engine, optimization techniques like quantization and speculative decoding,
all the way down to your GPU choice.
There's a lot of meat on the bone in this one, so I think you'll enjoy it.
As always, if you have questions or suggestions, feel free to reach out to me or Alex.
And with that said, let's get you over to my conversation with Philip. Philip, welcome to Software Huddle. Thanks, Sean. I'm super glad
to be here today. Yeah, thanks. Thanks for being here and glad you reached out. So to start, can
you give a little bit of background about yourself and the work that you do at Base10? How did you
end up working at an AI inference company? And what was your background in AI prior to that?
I joined Base10 two years, nine months ago. So I've been there in startup years basically forever.
Yeah, yeah. That's like, that's at least a decade in real life.
Yeah, exactly. So I joined as a technical writer, and have since moved into more developer relations type stuff. But
you know it's a 50 person series B startup so titles are titles, roles are roles. It's really
just about doing what you can do to drive what needs to happen every single day.
Anyway, Base10 itself is a model influence startup. So we're an AI infrastructure company that focuses on inference.
We run open source, fine tune, and custom models
with really high performance on GPUs
that can be distributed across multiple clouds,
your cloud, our cloud, some combination of the two.
And when I joined,
it was more of a data science-facing tool.
Back then, this was before GPT.
3.5 was out.
This was before ChatGPT, before Whisper, before DALI and Stable Diffusion.
So it was a somewhat different market and audience.
And I joined with very little background in AI or ML.
You know, I got like a B- in statistics in college.
But I was able to learn on the job.
And the cool part about my job is I basically just get to talk to all of our engineers, find out the cool stuff that they're working on and then write it down and have it make sense.
And in order for that to happen, it needs to make sense to me first.
So basically, my job is to learn really interesting things and then share them
with the world yeah i think you know i always found like a way to uh for at least for myself
like to really internalize something and to learn it is to like force yourself to try to teach it to
somebody because you have to understand it enough that you can like teach another person
and writing is a you know one form of of of, of, of teaching. And then of course, like, you know, actually teaching and, you know, or public speaking or whatever it is,
it becomes like a forcing function for you to learn different things. And I even feel like I
use the podcast medium for that as it becomes a forcing function for me to learn about,
you know, different topics where I may not necessarily know or have deep expertise in.
Absolutely. You know, I, uh, I can't go up on stage in a conference or meetup or something
and just wing it.
So there's a lot of learning
that goes into every single one of those things.
That's where that forcing function comes in,
like you said.
Yeah, and it has to feel authentic too.
And I think that's where,
you know, based on my own experience,
sometimes in very technical products,
selling to technical people,
salespeople can kind of struggle
where even if they have a script
and they have, you know,
they've been trained
and they have a first call deck,
it's like they're speaking
a foreign language in some ways
because it's just never going to be
sort of at the comfort level
as somebody who's like
actually worked in that space
as an engineer
and can kind of deliver
the same material.
Absolutely.
That said, I have worked very closely with our salespeople
to help train them and do sales enablement
in some of this technical material.
And I've been really impressed by how they've grasped
the complexities of stuff like the different GPUs
and how they have various capabilities,
the different model inference frameworks, the model performance optimizations,
the different modalities. There's a ton to learn. So I've
learned it and it's been cool helping them learn it as well.
Yeah, I didn't mean to dox all salespeople.
Yeah, absolutely. I think there's some great salespeople out there that are going to be able to grok this stuff.
But going back to Base 10, why focus on inference alone?
Why be essentially an AI inference infrastructure company?
That's a great question.
And it's something that we kind of found our way into.
Because originally, when I joined, the platform was much broader.
And we've experimented with other things.
We did a whole fine-tuning platform maybe a year and a half,
almost two years ago at this point as an experiment.
And what we found was two things.
Number one, we're a pretty small company.
You know, at this point we're 50 people.
We need to focus on being the absolute best at one thing
and own the right to enter other markets in this AI space.
And the second thing is that influence is really where a lot of the value is.
You need to, you know, you can have the best model
or you can have, you know, some insight into how to build a new product
around existing AI capabilities.
But if you can't run your models in production,
if they're not secure, if they're not reliable,
if they're not performant enough that you're going to hit your latency targets,
if they can't handle throughput, spikes in traffic,
top of product hunt, top of hacker news, launch days,
then the
rest of it kind of doesn't matter.
So we really see influence as the centerpiece of this AI infrastructure space around which
everything revolves.
And we want to go become really, really good at that one central thing, and then over time
add on more capabilities.
Yeah, I think that makes a lot of sense.
I mean, maybe like a rough analogy to that
is kind of even like traditional data warehousing
where essentially people,
like Snowflake came along and separated
compute and storage.
So then you have essentially like the query engine
that is like the key thing that you need in analytics
in order to be performant, and then you have essentially like the query engine that is like the key thing that you need in analytics in order to be performant.
And then you need large storage.
And what you're sort of, you know, what you guys are doing and other companies as well are like sort of decomposing this like AI stack.
So inference is sort of the compute engine that's important for generative AI.
And you're sort of separating that from other parts of it.
And then you can focus on like, let's make this thing like really efficient.
Absolutely.
And you do need to do more than just get a GPU
and put the model on it.
With Inverse, we've been able to learn a lot about,
for example, observability
and how those metrics and decisions can differ
when you're working with an AI model within your product rather than just some ordinary API.
So the observability stack looks a little different.
The logging looks a little different.
And the scaling looks a lot different, actually, the sort of traffic-based auto-scaling that you have to build around that is, you know, something that was surprising to us when we came across the correct answer
because it was, you know, not something you might intuit from previous experience building
in other domains.
Mm-hmm.
Yeah.
I kind of feel like when it comes to a lot of this stuff,
you need like a certain amount of like exposure to it,
to building up sort of the intuition that comes along with it,
to understand like what is the right, you know,
lever to sort of pull on or thread to pull on or thing to do in order to get the performance that you need,
whether that is like from an accuracy standpoint or even from like a compute standpoint
is you need to build up essentially enough knowledge
that some of that stuff becomes innate.
And I think one of the challenges right now
is that a lot of people,
basically this technology hasn't been around long enough
for that many people to kind of have their 10,000 hours
of building these types of systems.
Absolutely.
That's something we've seen even in recent months.
For example, you're familiar with the Whisper model for audio transcription.
Yeah.
So it's basically, at the architectural level, a large language model.
It has autoregressive inference.
It creates tokens.
So with that familiarity with the model,
recently we realized, hey, this is basically an LLM.
What if we just treated it as an LLM for model performance?
And we used all the same optimizations
that we do like TensorFlow OT LLM for serving Wispode.
And it turns out it works great.
So like you said, it takes a long time
to get sort of familiar and comfortable with this space and be able to start making these connections.
But once you do, you get to build and discover really cool things.
Yeah.
It's kind of that like building up that general knowledge from like a first principle standpoint.
So that then once you've built that expertise, you can start to draw lines to other things that are connected to it
or something that maybe from a surface level looks like two different things,
but is actually, when you have the expertise, you're able to see,
actually, this problem is kind of the same as this other problem.
We can solve it this way.
I love when that happens, when I run into a new problem and then I say, actually, wait,
you're an old problem wearing a Scooby-Doo mask. You just take off the mask and hit it with the
tried and true solution. Yeah, exactly. So I want to discuss a bunch of AI-related topics with you.
And you're starting with model selection. I think one of the challenges, especially people kind of
jumping into this space right now, is there's just so many models out there.
There's closed, there's open source, big, little.
When starting some kind of new project, how do you even pick?
Where do you start?
I generally like to recommend that people start with the largest, most capable model when they're in the experimentation phase.
Unless there's some constraint that means you can't do that,
like if you're building exclusively for edge inference on an iPhone or something.
If you're starting with the most capable model,
that eliminates one constraint during this experimentation and prototyping process,
which is, oh, is my model not good enough?
Because you're using the best model.
So either the model is capable, or if it's not capable, you need to either go do some
custom fine-tuning or training work to make it capable, which is a whole other discussion.
Or the exact thing you're trying to build isn't quite ready yet,
so either wait for the next iteration of the model
or figure out a way around those limitations.
But anyway, if you start with the biggest,
most powerful model,
you're able to eliminate that entire category
of questioning from your experimentation.
You're able to figure out the core flows of your product.
You're able to figure out the capabilities that you're looking for,
the inputs and outputs that you might expect.
And then once you're done with that,
then you have all of those other things locked in.
Then it starts to make sense to check out different models.
Because the thing about model selection is you really need strong evaluation criteria.
You can't really just work on vibes and say, hey, this model just feels better.
At least you can't at scale.
So when you're able to lock those other things in and build a strong set of criteria. And when I say evals, I don't mean
just like building an evals benchmark in the same way as MMLU or any of the other bunch of benchmarks
that every new model gets run against. I mean, figuring out for your specific product, how you
can tell if it's working and how you can tell if it isn't, and then running different models through
that. Then you can start trying smaller models to see like, okay, can I save substantial amounts
of money on inference by using a smaller model here?
You can also try composing multiple models together.
Say you've been using a very large sort of everything model, I like to call them, where
you have, say, a multimodal model that can do vision and language and maybe audio and speech and everything.
Well, maybe you can decompose that, have small, targeted, much cheaper to operate models that handle each of those modalities and then compose those models altogether.
You have a lot of options for optimization after that initial prototyping phase.
But when you're just playing around, just play around with the best model. Yeah, I mean, I think it's kind of about
like, you're trying to reduce variance, essentially, like if you're, there's a lot of
things that go wrong, let's let's take essentially the quality of the model off the table. It's kind
of like, if you're running an experiment in a company that requires like a person execute something, it's much better to put probably your best person on that thing.
So that way, if it fails, you know it wasn't the individual that led to the failure.
Whereas if you put somebody who's maybe not so good at that, then it could be that person was just bad at executing whatever that project happened to be.
In terms of figuring out whether things are working or not, what are the strategies there?
How do I know whether something is working well or not?
And how do I create essentially my own benchmarking mechanism for that
so that when I do go and try maybe to try a smaller model or a different model,
I can test whether I'm reaching the performance I need or was able to reach before?
So there's a couple different approaches here.
I think the strongest approach is to use a evaluation framework or evaluation library,
build a set of test inputs, build a set of expected outputs.
There's a sort of LLMS judge pattern that can be very effective here for making sure
that you're able to run through large numbers of checks
and then have a large language model, perhaps the original model that you were using,
judge the quality of those outputs at scale.
But it's a tough problem, and there's a reason that there's so many teams and companies working on evals,
because it's absolutely critical, but it also can be very dependent on your product.
One thing we work a lot with is transcription models. And so we're always looking at word
error rate. But you have to drill a little deeper into that. So is it word error rate in general?
Is it consistently messing up names? Is it consistently messing up names?
Is it consistently messing up the titles of medicines? There's certain errors that are more
impactful than others. So it's a combination of a little bit on the technical side trying to
set up the framework for evaluation, but mostly it's a domain understanding
problem where you have to be really clear about what you're trying to accomplish with your system
in terms of like cost savings you know if if i decide like okay i want to like you know save
some money i'm gonna go and use like an open source model then i need to run it. I've got to get some GPUs. I've got
to run my own cluster and stuff like that.
Then I've got to manage that thing, handle updates and stuff.
Am I actually going to
save any money that way?
By taking on that work,
is it really
an apples-to-apples comparison
there? Or is that
a false way to think about potentially saving costs?
That's a great question.
It's really total cost of ownership, right?
So I think about it less as going from closed models to open models
and more about going from shared inference endpoints to dedicated deployments.
Because you can get a shared pro-token endpoint of LAMA 3.1 or, you know, on the Straw model,
you can get for different image generation,
Whisper, you can get shared inference endpoints
for basically all of the big-name open-source models.
And then just like when you're using the chat GPT
or the OpenAI GPT-4 API,
when you're using Anthropix Cloud APIs, you're still paying per token, you're using the OpenAI GPT-4 API,
when you're using Anthropix Cloud APIs,
you're still paying per token.
You're just hitting an endpoint.
It's all in the same format.
And that is really great in the early stages.
If you're consuming a few thousand,
even a couple hundred thousand tokens a day, and you're paying per million tokens, you're probably doing great on those shared inference endpoints.
And the additional cost of having your own GPUs is not going to make sense.
But as you scale up, then the economies of scale start to work in your favor.
So I generally tell people to switch to a dedicated deployment once they have enough traffic to consistently saturate a single GPU of whatever type makes sense for their model.
Because then you're starting to buy your tokens in bulk.
But you also get a bunch of other benefits that can make sense, including making
sense at somewhat smaller scales. The number one thing, obviously, is privacy. If you are the one
running the GPU, then your inputs and outputs can't be, you know, read by OpenAI or anyone else.
There's also the customization aspect, where not only are you able to control the model,
in that you can do fine tuning laws,
you can do an entirely custom model, all that sort of stuff. But you can also control things like
setting the batch size to trade off between latency and throughput to get your exact cost
characteristics while still hitting your latency SLAs. So there's a bunch of, you know, aspects of
control, privacy, reliability, not having noisy neighbors.
Someone hits the endpoint with a million requests and now all your stuff is slow.
All that stuff just doesn't happen when you have dedicated deployments.
So they can have a better total cost of ownership story once you reach scale or if you have certain regulatory or compliance things or just privacy concerns.
And in terms of saturating a GPU, what's that look like?
Yeah. So I'm not saying you need 100% utilization of a GPU at all times, but generally what I'm
looking for is a sort of high volume traffic pattern that is, well, it doesn't have to be super consistent and predictable.
Scaling GPUs up and down, while difficult, is pretty doable.
So you can get pretty fast cold starts denominated in seconds in many cases for these GPUs. It's really more about just having enough traffic so that you're running,
you know, batches of inference. Generally, you're running multiple requests at the same time,
because if I have a whole H100 just for, you know, a whole, in this case, two or four H100s,
just for me to play with the new Nemotron model, then that's a, that's a big waste of, of compute resources. But if I'm, you know,
sending through some, some pretty consistent, stable traffic,
then I'm actually, you know, getting the,
getting the benefits of buying my tokens in bulk.
Okay. All right. So, so we, we, we've reached, you know,
we got our model, we figured out whether we're going to, you know,
rely on a shared inference endpoint, or we're going to, you know,
essentially run this myself.
When does it make sense for somebody to fine tune a model?
So I saw there's this great tweet that goes around a lot.
No GPUs before PMF, which is the idea that, you know,
you shouldn't be making these very capital-intensive investments in training and fine-tuning
ahead of figuring out what you actually want to build.
And, of course, there are exceptions.
If you're a foundation model lab, obviously, you need the GPUs to get the PMF.
But if you're just building a product, I do think that fine tuning is in many
cases, kind of the last thing you should reach for. It has a ton of useful applications. And
there's a bunch of places where it makes sense to use. But first, you know, obviously prompting,
and then patterns like retrieval augmented generation, if you want to feed in real time
data. But overall, when I think about fine tuning, you know, it's less about changing the information in the model.
You know, that's really what WAG has sorted out.
It's more about changing the behavior of the model.
So you can fine tune a model to add, say, like function calling.
If it doesn't support it out of the box, you can fine tune a model to do better at math or code or something or some specific domain.
We work with Vito who they trained from scratch over fine tuning, but makes these domain specific models for stuff like medicine and finance that that, you know, can also be achieved
in some cases through fine tuning.
But, you know, it's really, again, more about changing the behavior of the model versus changing
the knowledge of the model where fine-tuning starts to be super useful. So you mentioned
earlier this idea of stringing multiple models together, this concept of compound AI. And that's
something I've heard a lot of buzz of around this year. There was a paper from Berkeley's AI research group, I think earlier this year, that talked about this concept.
So I wanted to talk a little bit about that. Can you explain what is compound AI? Why are
people excited about it? Yeah. So compound AI, a little bit of a buzzword, but there's a lot
of substance there in that you want to basically introduce new capabilities to a model.
So we, you know, we're just talking about fine tuning is one way to do that.
But as a very simple example, let's just talk about math.
So, you know, LLMs can't add two numbers together with any degree of reliability.
That's just not what they're meant for.
So, you know, you can use function calling, you can use tool use to send, you know, a bunch of
mathematical functions to the LLM with an explanation of what each one does, along with
subparameters. It'll select a function, send it back. You can run that math in, you know, your
Python code interpreter, send back the answer, and ask the LLM to find an
explanation. Well, in this super simple, trivial, contrived example, we've made two calls to the
LLM, and we've done one piece of business logic in between. That's an example of a multi-step
inference. There's also examples of multi-model inference. For example, if you're doing AI
phone calling, like Bland is a great example of this. You might have seen their billboards
around where it's like, why are you still hiring humans? They have this AI phone calling
platform where you call it up and it can do customer support, it can do ordering a pizza,
that kind of stuff. And if you think about it,
doing a real-time phone call with AI
is three different parts.
You need to take what the person on the phone is saying
and turn it into words written down, so text.
You need to respond to that text
using a large language model.
And then you need to synthesize that text into speech.
So you could accomplish this with a single like Omni type model, and then you need to synthesize that text into speech. So you could accomplish this with a
single like Omni type model, but you could generally get much better performance and cost
characteristics if you actually compose three models together, one for transcription, one for
LLM, and one for speech synthesis. So either of these can be examples of compound AI.
Compound AI is a sort of framework of thought.
I still haven't come up with exactly the right noun that I want to use for compound AI. for composing multiple models, multiple steps, or both into a single chain of influence with, you know,
conditional execution logic, business logic, all that stuff baked in.
Yeah, and sometimes, like, correct me if I'm wrong,
it's not always necessarily chaining sort of Gen AI models together,
but it could actually be like a Gen AI model with traditional ML also plugged in.
Like, I think AlphaCode, I believe, does something like that, be like a gen a model with traditional ml also plugged in like i think alpha code uh i believe
does something like that where they use an lm to generate like a million potential solution
programming solutions to these like uh you know programming problems and then you use a second
i i don't think it's an lm based model but they use something else essentially to filter those
down to get to the solution that they believe to be the most correct. Absolutely. I think if you want to look at it at the very highest level, compound AI is
about adding specialization back into generative AI. You have these highly capable general models,
like an LLM can do a pretty good job at almost anything. But if you're building
these, you know, production-ready AI systems that are going to need to be fast, reliable, accurate,
then you kind of need to add specialization back in to have an edge. And it's really exciting to
work on because now we're building, you know, real applications out of AI, sort of AI-native applications,
rather than just wrappers around a single capable model.
Yeah. It's kind of interesting, this evolution, because historically, before we had these kind of
Gen AI models, where ML really excelled was something that was a really specific task.
You could essentially tune something to do fraud detection, spam detection, these types of things, classic classification, and it could
perform really well at that. And then the incredible thing that blew everybody away
with large language models was suddenly at this super general thing that you could talk to,
it would answer whatever, it would generate some sort of response.
But in order to actually, I think, serve a lot of practical use cases, you need sort
of both approaches because the general model is just going to be like, if it doesn't have
a good answer, it might hallucinate something or like the quality is going to be not at
the level that you need.
So then you need to do something to essentially make it more specialized.
And it seems like at the moment
where we're seeing the best performance
is sort of these like hybrid systems
or this compound AI approach
that brings both of these worlds together
of the like specialization and the generalization.
Exactly.
You can almost think of the general models
as a sort of front end or user interface for these more specialized tools.
And, you know, that kind of-model resources like data storage and
code execution API calls. And that general language model is just the interface that
you're using to interact with all of these things in a new way.
Yeah. So when it comes to, I think combining models makes a lot of sense, but like, what about the performance?
Like already, you know, talking to a model can be relatively slow for a lot of applications.
And now we're talking about like chaining multiple models together or even, you know, there's models that even in the large models that sort of compete against each other.
Some of them are better at asking, answering certain types of questions than others. Probably the best case solution would be you'd put in a prompt and you would fan out to all the
models, they'd all answer, and then you'd have another model pick the best answer or something
like that. But that would be really expensive operationally and also expensive from probably a
cost perspective. So how do you get the performance out of these systems that can actually serve
like these use cases? Yeah. So, you know, on the quality side, what you're talking about is
sort of model routing. And there's a bunch of great, there's a bunch of great teams working
on that. I can shout out my friends at Unify are an example of people who are working on that,
where if you have like a simple query, maybe it gets sent to a small model if you have a more complex one it gets sent to a large model
but the other half of of performance is the is the latency side right and the cost
where the difficulty of building with compound ai is that you have to iron out the performance bugs and the, you know, 10 milliseconds of network
latency here and the cold start time there and the, you know, queue backing up at this stage
in your graph. That's where having, you know, really strong model performance tooling and
really strong distributed infrastructure tooling has to come together to make these systems possible.
So in terms of inference, like what is the expensive part of inference?
Like why is that particularly like a slow operation?
Yeah, it depends. Let's, you know, for the sake of keeping this podcast under five hours, I'll a little bit just about large language models and other similar auto-regressive models.
We talked about Whisper before.
It has very similar performance characteristics.
So there's kind of two phases of LLM inference.
There's the pre-fill phase and there's the token generation phase.
So the pre-fill phase is when the model
gets the prompt and needs to come up with the first token. And you're kind of parsing every
token within the prompt. This part is GPU compute bound. So in general, the sort of flops that your
Teraflops that your GPU is capable of is going to limit the speed of pre-fill.
And then there's the autoregressive phase, the token generation phase,
where the whole previous input plus everything that's been generated
gets passed through the model over and over and over again.
And this phase is actually going to be bound by GPU memory bandwidth. So for example, an H100 GPU has 3.35 terabytes per second
of GPU memory bandwidth.
And that is the limiting factor in many cases
on how quickly it can execute a model like Lama 405B
where all of those weights have to be loaded
through memory over and over again.
So again, if you have, you know, it depends, right?
So if you have a case where you're sending a very long prompt
and expecting a very short answer,
you could potentially be bound on GPU compute
and have a sort of pre-fill times the first token problem.
Right.
Or you could, you know, have a more traditional like multi-tone chat, for example, where the LLM is going you could have a more traditional multi-tone
chat, for example, where the LLM
is going to be doing a lot of generation,
then you have a memory
bandwidth problem. I'd say more often
than not, we're running into
memory bandwidth versus
token versus
prefill,
but definitely both make
an appearance in just about every performance optimization puzzle.
And what are some of the techniques to,
to try to optimize the inference cycle?
Absolutely.
So the range of techniques runs from ones that work really well,
basically all the time to ones that have some pretty meaningful trade-offs.
And it depends.
So it depends on your traffic pattern.
It also depends on what you're optimizing for.
Like you can just say, I want to optimize my model,
but that's not really what optimization is.
You kind of have to pick a goal and work in that direction
at the expense of other things.
You could optimize for latency.
You could optimize for throughput, for cost.
Assuming we're just trying to get the best possible latencies
at a reasonable throughput and cost threshold,
which comes from batching,
the first thing that basically always works
is going to be adding some kind of inference optimization engine.
So are you familiar with VLLM and TensorTLLM? is going to be adding some kind of inference optimization engine.
So are you familiar with VLLM and TensorFlow TLLM?
Yeah, I mean, at a sort of high level.
Yeah.
So for anyone who's listening who hasn't really heard of these,
basically you can take your model weights and you can take the Transformers library
for writing inference code in Python,
and you can just write some Python code and run it on your GPU,
and it'll produce tokens.
But there's also inference optimization frameworks like VLLM.
It's a very popular open source one.
There's also Tensor RT-LLM, which is also open sourced by NVIDIA.
And what these do is provide a set of sort of out-of-the-box optimizations that make
the model run faster with almost no downside. With Tensor RT-LM in particular, that's what we use
very heavily at Base10. It works by building an engine for your model inference. This engine is
a little limited in that it's built for a specific GPU and a specific model.
So once you kind of build the engine, you've got to rebuild it if you want to change anything.
And the engine can be a little large, which can increase your instance, your image size and your cold start times a bit. But once you actually get it up and running, it's really, really fast because it actually creates optimized CUDA instructions for every part of your model inference process.
It's basically almost like compiling your model inference code into these optimized CUDA instructions.
And that's something that other inference frameworks don't do.
So that's why TensorFlow RTL-LM in many cases can have this really excellent performance. So that's why TensorRT LLM, in many cases, can have this really excellent performance.
So that's kind of step one, is you
want to just make sure you're using a serving engine.
You want to make sure you're using, of course,
appropriate hardware.
And then just one question on TensorRT.
So the thing that I've heard, or the criticism I've heard about
is it's hard to use.
So what makes it hard to use?
It definitely has a steep learning curve.
And again, there's this engine portability problem where you need to create the engine,
but then you also need to package it up in a Docker image and go ahead and serve it.
I think that the team at NVIDIA is doing a lot to make it easier to use.
You know, we've seen things like input and output sequence lengths,
which previously had to be kind of set in stone during the engine building process, become one more flexible, flexibility on batch sizes, that kind of stuff.
It's also complicated to use because it provides a very generous API
where you're able to twist a lot of knobs yourself and set it up.
But yeah, I mean, we definitely have also experienced
some of the pains of that developer experience.
And that's why at Base10, we built something called our TensorFlow RT LLM Engine Builder,
where we create a YAML file and specify, you know,
the sequence of links, model weights, GPU count,
GPU type, and various optimizations, quantization,
that kind of stuff you might want to put in.
And then it builds and deploys it automatically for you for, you know,
like supported models as well as fine tunes of those models.
And that's actually, you know, speaking of our, you know,
technically excellent sales people,
we've actually had some of them be deploying models, you know,
for demos and customers and stuff using this.
So there, you know,
while the TensorFlow TL LLM developer experience
can be a little rough under the hood sometimes,
for a lot of cases, it's possible to use this tooling
to make it much easier to build these engines.
OK.
All right, so based up, basically,
get your user and inference engine
to help do some of this automatic or sort of low,
I guess the best way to say it is it gives you optimization
without a lot of downside.
What's next?
So what's next is you start looking at trade-offs
between you've got four sort of levels to play with you.
You've got latency, you've got throughput,
you've got cost, and you've got quality.
So if latency, throughput, and quality
are absolutely critical,
and the only thing you can be flexible with is cost,
throw more GPUs at the problem.
I love the solution because I'm one of the people you can get more GPUs at the problem. I love the solution because, you know,
I'm one of the people you can get those GPUs from.
But in many cases, you don't want to do that.
You want to, you know, keep your costs.
You know, that's why you're coming to a dedicated deployment
is so that you can kind of have control over your costs
and buy your tokens in bulk, like it's Costco or something.
So generally generally you start
thinking about how to trade off latency and throughput versus quality. And you have to do
this very carefully because, you know, when I say trade off versus quality, hopefully these
optimizations that we're doing next have zero impact on the quality of your model output.
So it's not so much that you're intentionally losing quality, more that at every step you have to verify your output, run your same evals as you've
been running before, and make sure that your model output quality hasn't suffered through
this process.
Because the first and most important optimization that's out there is quantization.
And while we are, you know, constraining our discussion to large language models at the moment,
this does actually apply for basically any kind of model. You can quantize, say, like a stable
diffusion model and save on inference there as well. So quantization is the idea of changing the number format
that your model weights and certain other things
like activations, KV cache are expressed in.
By default, model weights are almost always
expressed in a 16-bit floating point number format.
So your model, let's say, Lama 405B,
has 405 billion parameters.
Each one of these parameters is a 16 bit number,
16 bit floating point number, and that's going to be two bytes of memory.
So this whole, all these model weights are, you know, 900, yeah,
900 gigabytes of memory because it's's 405 billion times two bytes.
Yeah.
So as we discussed earlier, one of the big bottlenecks in inference is going to be your GPU memory bandwidth.
So if you could move less weight through that bandwidth, you could move a lot faster.
You could do your inference a lot faster.
And actually, at the beginning, we also talked about pre-fill and how that's compute bound. Well, your GPU at a lower
precision, it has, every time you cut your precision in half, you double the number of
flops that you have access to. So it's a linear improvement. You're not from quantization
generally to double your performance. In fact, it'll usually be pretty far from that. But in terms of just like the GPU resources that are
available, you would have double the resources if all of your numbers were half as big.
So that exists. It's called quantization. It's the idea of converting your data format from,
say, a 16-bit float to a 8-bit number format or even a 4-bit number format.
Now, the potential downside is a loss of quality because your number format now has a lower dynamic
range. So dynamic range is the sort of breadth of values that are able to be expressed.
And that's why we've found that the new FP8 format is very impactful here.
So rather than going to Int8, which is an 8-bit integer number, which has a pretty constrained dynamic range,
if you go to FP8, you get a great dynamic range because you're still doing floating point.
And this can help with keeping your model's perplexity gain under control
and keeping your output quality very high
during quantization and allow you to quantize more than just the weights and with the new blackwell
gpus that are coming out on the sort of next generation architecture we're expecting really
great results from fp4 which is a four bit floating point quantization format currently if
you want to go all the way down to four bits, which is popular, especially
for local inference, like learning models on your MacBook or something, then you're
constrained to those integer formats, which do have that very, very limited dynamic range.
When you lose quality through quantization, how does that generally manifest itself?
So the best way to check for that is through something called perplexity gain.
So perplexity is a calculation
that you can do on a large language model
where rather than generating outputs,
you actually give it outputs.
And then you see how, for lack of a better word,
you see how surprised it is by those outputs.
You see the number of times where it says, hey, this is not the token that I would have generated here in this known good output.
And with your quantized model, you know, you want your now there's different ways of calculating perplexity.
So it's difficult to do absolute comparisons model to model, for example. But within the same model,
if you quantize it, you run your same perplexity check using the same algorithm and same data set
again, you generally want your perplexity to not really move at all. There's a small margin of
error, but you don't want to see a gain in perplexity where the model is saying like,
hey, before this is the token I would have made and it's a good token, but now this is not, you know, what I would put out. That's how you see
that, okay, you know, maybe some of these model weights that you compressed, you got a little,
oh, it's not compression. I shouldn't say compressed, but these model weights that you
made a little less clear actually were very important and affecting the output. sort of measured quality through things like perplexity or the observed quality through just,
you know, consistent customer use and seeing that the model output is satisfactory.
What about speculative decoding?
Absolutely. So that's kind of like the next step. So once you have your inference engine in place,
once you have your quantization, and by the way, TensorFlow RT LLM
can actually quantize your model for you during the engine building process, which is pretty cool.
Then you start to look at all of these more exotic approaches to generating more tokens.
And this is where it gets really exciting, but it also gets even more dangerous to your output
quality. So speculative decoding, there's a few different versions of this. There's one where
you can do it with a draft model. There's also a variant that we've used a lot called Medusa,
where you kind of fine-tune your model to have something called Medusa heads, which are able to
create these tokens. But overall, the concept of speculative decoding is, if you run, say, Lama 3.2 3B and you ask it a question, it's going to give you a pretty reasonable response.
It's a highly capable model, even though it's super small.
But it's not going to be as good a response as Lama, say, like Lama 4 or 5B, which is 150 times bigger.
But most of the tokens are going to be the same.
At least a lot of the tokens are going to be the same.
It might start at, hi, my name is.
It's going to give a lot of questions, answers that might have 50% the same tokens.
So what if you could get all those tokens from the super small cheat model and only
turn to the big model when the small model gets it wrong? So the small model in the speculative
decoding setup generates these draft tokens, and then it checks them with the big model.
Kind of in the same way that it's a lot easier to tell if a Sudoku is done correctly, you can just
scan it really quick and count up all the numbers versus solving
it yourself, it's a lot easier to verify that a token is correct. And by easier, I mean like a lot
less computationally intensive to verify that a token is correct than it is to generate that token
from scratch. So as long as you have a really good verification process, then you're going to get the same output after implementing specular decoding as you did before. Instead of having a draft model, you sort of fine-tune your base model so that rather than generating one token at a time, it generates three or four tokens at a time.
One thing that's really interesting with that is you have to fine-tune it with some sort of domain awareness because otherwise your model might perform just fine at, say, math and then software and humanities or something.
So all of these approaches have potential downsides.
For example, with speculative decoding, if your draft model gets every single token wrong,
well, now you've actually made it slower because you're running the draft model,
validating every token and regenerating every token from the larger model.
But with careful implementation, there's definitely a lot of potential for upside with these approaches.
You can see absolutely massive improvements to your tokens per second.
How much do you think some of these techniques are here to stay versus a moment in time because we haven't developed essentially the chips that can handle the performance that we need?
There are a lot of custom hardware manufacturers who are trying to take the fight to NVIDIA, which is interesting.
And NVIDIA itself, of course, is making new and more powerful graphics cards all the time for this.
I do think, though, that within any given hardware or any given model,
there's still room for these techniques.
So the model and the hardware establish a sort of baseline performance.
And then from there, you can layer in all this stuff.
So I think I will see like as new models come out,
as new hardware comes out,
these techniques or, you know,
all of this is very new applied research.
So newer and more efficient techniques
are coming out all the time.
But you'll see these approaches
or approaches like them applied over and over again
as these new models and new hardware come out.
Because the thing is, it's a lot of work
to do this kind of performance optimization
and to do it well.
You're not going to always see it like day one
for every new piece of hardware and every new model.
But I do think that over time,
these benefits are really going to compound, honestly.
If you look at TensorRT LLM, for example,
it does, if you run a model on an A100, which is the sort of last generation Ampere big GPU versus an H100, which is the current generation, than the baseline hardware specs would imply because the model is taking advantage,
because TensorFlow RTLM is taking advantage of the architectural features of the H100 GPUs.
And then you get that benefit.
That's multiplied by the benefit you get from quantizing.
That's multiplied by the benefit you get from specular decoding.
So as newer, more efficient models come out, as newer, more powerful GPUs come out,
you're going to see even more benefit
from using these approaches rather than less.
Right.
So, I mean, I think like most people,
you know, running a model,
you know, building some sort of AI-powered application
through RAG or something,
they probably don't necessarily need to be like necessarily need to be looking at the GPU level.
But if you are really trying to squeeze every bit of performance possible
out of the inference step, you probably do need to look at the hardware level.
So you mentioned a couple of different chips.
There's a lot of different GPUs, just like there's different CPUs.
How do those impact the performance on inference and does it is there like certain gpus
that make more sense for certain models or does that matter at all absolutely it makes a huge
impact and like you said you know when you're just starting out it probably doesn't make sense to
look at this gpu level but it's really a question of where do you want your competitive advantage to come from?
And if you're committed to a certain model or a certain inference pattern
for a long enough period of time to see the benefits,
that's where these optimizations can be really useful.
In terms of the exact GPUs and their various advantages and disadvantages,
so the current generation is Hoppo, and that architecture currently is available on the H100 GPU.
There's also the H200 GPU, which is even bigger.
There's also a GH200, which is a sort of hybrid between the two.
And this is really great for running the biggest and latest models,
especially because the Hopper and Lovelace,
which is the previous architecture,
are the ones with support for FP8,
which is that really effective quantization format.
So within that, though,
the smallest memory that's available is 80 gigabytes, which is just complete overkill for a huge class of models, including a lot of image models, which tend to be 12 or fewer billion parameters, as well as, of course, audio transcription models.
You know, the largest Whisper model is only a couple billion, if that.
So you also have the L4 GPU.
That's the previous generation Lovelace,
which is, you know, it's a 24 gigabyte GPU.
It has much lower memory bandwidth.
So it's not great for large language models in many cases,
but it's great for some of these image models,
some of these audio models,
some of these speech synthesis models.
You have, you know, the previous generation Ampere cards, which have an A10, you have an A100.
Both are great lower cost alternatives to the L4, the H100.
Sorry, no, A10s cost more than L4s.
But A10s have higher memory bandwidth, so they're good for small language models.
You have got like older GPUs like T4s, which are great, you know, for just low cost workloads of small models.
And what we see often is people kind of graduating from one GPU to the next.
You know, maybe you've trained a small model that does some specific modality, some new thing, and it runs just fine
on the T4. Well, eventually, you know, rather than getting your 30th T4s as traffic is picking up,
you can start to get better performance and better cost characteristics by consolidating to A10s or
L4s. So as you know, it's not just about model. Obviously, model matters a lot because your GPU needs to be big enough to hold your model and your KB cache and your batch inference.
But, you know, we also see people kind of upgrade over time.
And one thing we've done to facilitate that is with H100 GPUs, you can actually split them up through a process called MIG. And no, MIG, unfortunately,
is not a cool fighter plane.
It stands for Multi-Instance GPU.
With that, you can
basically cut an H100
in half, and each half
has 40 gigabytes of
VRAM. It's got like three-sevenths
of the compute. It's got half of the
VRAM bandwidth.
And these GPUs offer these MIG to H100s
offer really great performance characteristics for these
smaller models. Generally, anything under like 15, 12
billion parameters can fit on it. They offer you
that FP8 access, and then they also
have really good Fiske characteristics
because you're only paying for half a GPU.
Yeah.
So I think we've covered a lot on performance
from essentially choosing a model
to choosing essentially an inference engine
to really tuning the inference cycle through quantization
and some of these other techniques to picking a GPU.
Is there anything else on inference optimization
you think that we should touch on?
Yeah, I mean, there's really the whole network layer.
There's the whole infrastructure layer that's very important.
If you do all of this work to squeeze every second of time to first token out of
your model by optimizing pre-fill, and then you take your GPU and you stick it in, you
know, US East 1 in Virginia, and your user is in Australia, then all that model performance
work is right out of the window because of the network latency.
So there's really like two completely separate domains that have to be considered together in
highly optimal model serving infrastructure because it's both this
model performance layer and it's also the infrastructure performance layer.
So obviously there's locating your GPUs reasonably near
to your user. That's important.
With compound AI, it's minimizing the network overhead in the multiple hops that you're making from model to model.
So, yeah, it's definitely that networking and infrastructure component where you have to consider, for example, cold starts. If you're getting a burst of traffic, you need time to scale up those GPUs and load in
these very large images and these very large sets of model weights. So being good at that part of it
is absolutely essential for doing this production because otherwise your you know really flashy performance
metrics are not going to be what your user actually experiences yeah absolutely that's a
really good call out like at the end of the day unless you're running this you know essentially
in a closed system on an edge device most likely you're making some sort of network call
and then you know if you're if that caught if that, you know, if you're, if that caught, if that's, you know,
900 milliseconds, it doesn't really matter what the, what the performance is like on,
on the actual inference piece. Exactly. We've even seen stuff as simple as,
you know, switching from the requests library in Python to doing a connection pooling with HTTP,
saving, you know, that extra 10 or 20 milliseconds and making the
difference between hitting the latency SLA and not hitting it. So, you know, there's a whole lot of
work to be done at the infrastructure and the networking level to actually deliver this excellent
performance to the end user. Also, like, how much does the programming language choice factor as well?
A lot of people's go-to when you're talking about AI and machine learning is Python,
but Python's notoriously not the most performant language in the world.
Is that something people need to also be considering?
In some cases, that's where the influence engine can help a lot.
For example, TensorOT LLM is a C++ model server, basically. Well, okay, so there's TensorOT LLM,
there's also Triton, which is the Triton influence server. Those two generally go together. And we
actually built our own variation of the Triton Influence server that's even more lightweight, but that's also built in C++.
So, yeah, there's a lot of C++ under the hood in these kind of stacks, and that's going to help a lot with the performance.
And, of course, the GPU inference part, that's going to be a problem for CUDA to handle.
And that's where, again, TensorFlow TL1 helps out.
So, yeah, it's, you know, speaking as someone
who is really only comfortable programming in Python and JavaScript,
I'm able to, you know, leverage a lot of these tools
that people much smarter than I have have put together
and stand on the shoulders of giants
and be able to, you know, write my little Python code
and not really worry about performance too much.
Awesome.
All right, well, let's switch gears a little bit
and go quick fire.
So if you could master one skill
that you don't have right now, what would it be?
So I would really love to be better at math,
which I realized I had a great opportunity to do in college
and totally squandered getting a lot of Bs in math
classes. But I was working on this transformer inference blog post a year ago with someone at
base 10. And he's an engineer there. And he was sort of explaining to me the actual math behind
figuring out like, how do we prove that this inference is bound by GPU memory bandwidth
rather than something else.
And just looking at these papers
and all of the Q, K, and V and stuff,
I like it when math has numbers in it, not letters.
So it would be really cool if I could be a lot better
at some of this math behind the influence.
But I'm learning it slowly but surely.
Yeah, never too late. What wastes the most time in the day?
Definitely uncertainty. You know, working at a startup, there's always a lot of stuff going on.
It's like, oh, what should I work on? What should I do? Is this right? Is this good enough? Should I
spend time on this or that? You know, uncertainty and just not pulling the trigger and making a
decision is what I probably waste the most time.
If you could invest in one company that's not the company you work for, who would it be?
There's this great company called Pylon.
They do customer support.
We're customers of them.
I'm friends with the founders.
They were actually in Chicago the other day. And I tried to tell them, like, hey, I'll trade you a couple hundred you know, a couple hundred of my shares for a couple hundred of your shares,
and they didn't go for it.
But they're a great company.
They're working super hard.
They have a lot of customers, including us, who really love them.
And they're taking on some incumbents like Zendesk that I think are ripe for disruption.
So, yeah, big fan of Pylon.
Cool.
What tool or technology could you not live without?
Honestly, that sounds super simple.
It's Markdown support in Google Docs.
I was so happy when that first came out
because I use Google Docs for basically everything I do
just because everyone has it and it's super easy to work with.
And just not having to train the
markdown shortcuts out of my fingers to use it has been really nice which person influenced you
the most in your career I want to I want to shout out uh you know a couple people um there's this
guy uh Lee Robinson um who's also from Iowa I'm I was raised Des Moines, Iowa. So I often say that I want base 10 to become Vosell
and I want Philip to become Lee.
So that's a pretty simple way of me expressing my career goals.
But I've also been influenced a lot by Patrick McKenzie,
who goes by Patio11 on a lot of internet spaces.
I've really appreciated his writing on startups and entrepreneurship and found a lot of internet spaces. I've really appreciated his writing on startups and entrepreneurship and
found a lot of,
uh,
a lot of wisdom there.
So yeah,
those are,
those are some of the very,
very long list of,
of influences.
Yeah.
Good pick.
So,
and then finally five years from now,
will there be more people writing code or less?
Never bet against coders.
Yeah.
I mean,
I always,
I always feel like,
um,
you know, that, an engineer's job is to solve problems.
That's why you get paid the money that you get paid. So at the end of the day, that's not necessarily about sort of hands on keyboard.
It's about essentially solving problems.
So maybe the manifestation or the evolution of that is you're doing a lot more
prompting to generate code that you're sort of working with,
but you're still,
you know,
essentially solving problems and putting these things together.
Absolutely.
I mean,
there's some days with,
with my job that spans all these different departments and functions,
though,
there's definitely days where I don't write code,
but I still feel like I'm a software engineer.
Even, even when I'm solving these other problems, just because it's more about
the mindset and the approach you have to solve the problem rather than the exact problem you're
solving. Yeah, I don't think the engineering mindset doesn't go away.
All right. Well, Philip, thanks so much for being here. I enjoyed this a lot.
Cheers. Yeah, thank you so much for having me. This was a really fun time.