Latent Space: The AI Engineer Podcast - Everything you need to run Mission Critical Inference (ft. DeepSeek v3 + SGLang)
Episode Date: January 19, 2025Sponsorships and applications for the AI Engineer Summit in NYC are live! (Speaker CFPs have closed) If you are building AI agents or leading teams of AI Engineers, this will be the single highest-sig...nal conference of the year for you.Right after Christmas, the Chinese Whale Bros ended 2024 by dropping the last big model launch of the year: DeepSeek v3. Right now on LM Arena, DeepSeek v3 has a score of 1319, right under the full o1 model, Gemini 2, and 4o latest. This makes it the best open weights model in the world in January 2025.There has been a big recent trend in Chinese labs releasing very large open weights models, with TenCent releasing Hunyuan-Large in November and Hailuo releasing MiniMax-Text this week, both over 400B in size. However these extra-large language models are very difficult to serve.Baseten was the first of the Inference neocloud startups to get DeepSeek V3 online, because of their H200 clusters, their close collaboration with the DeepSeek team and early support of SGLang, a relatively new VLLM alternative that is also used at frontier labs like X.ai. Each H200 has 141 GB of VRAM with 4.8 TB per second of bandwidth, meaning that you can use 8 H200's in a node to inference DeepSeek v3 in FP8, taking into account KV Cache needs. We have been close to Baseten since Sarah Guo introduced Amir Haghighat to swyx, and they supported the very first Latent Space Demo Day in San Francisco, which was effectively the trial run for swyx and Alessio to work together! Since then, Philip Kiely also led a well attended workshop on TensorRT LLM at the 2024 World's Fair. We worked with him to get two of their best representatives, Amir and Lead Model Performance Engineer Yineng Zhang, to discuss DeepSeek, SGLang, and everything they have learned running Mission Critical Inference workloads at scale for some of the largest AI products in the world.The Three Pillars of Mission Critical InferenceWe initially planned to focus the conversation on SGLang, but Amir and Yineng were quick to correct us that the choice of inference framework is only the simplest, first choice of 3 things you need for production inference at scale:“I think it takes three things, and each of them individually is necessary but not sufficient: * Performance at the model level: how fast are you running this one model running on a single GPU, let's say. The framework that you use there can, can matter. The techniques that you use there can matter. The MLA technique, for example, that Yineng mentioned, or the CUDA kernels that are being used. But there's also techniques being used at a higher level, things like speculative decoding with draft models or with Medusa heads. And these are implemented in the different frameworks, or you can even implement it yourself, but they're not necessarily tied to a single framework. But using speculative decoding gets you massive upside when it comes to being able to handle high throughput. But that's not enough. Invariably, that one model running on a single GPU, let's say, is going to get too much traffic that it cannot handle.* Horizontal scaling at the cluster/region level: And at that point, you need to horizontally scale it. That's not an ML problem. That's not a PyTorch problem. That's an infrastructure problem. How quickly do you go from, a single replica of that model to 5, to 10, to 100. And so that's the second, that's the second pillar that is necessary for running these machine critical inference workloads.And what does it take to do that? It takes, some people are like, Oh, You just need Kubernetes and Kubernetes has an autoscaler and that just works. That doesn't work for, for these kinds of mission critical inference workloads. And you end up catching yourself wanting to bit by bit to rebuild those infrastructure pieces from scratch. This has been our experience. * And then going even a layer beyond that, Kubernetes runs in a single. cluster. It's a single cluster. It's a single region tied to a single region. And when it comes to inference workloads and needing GPUs more and more, you know, we're seeing this that you cannot meet the demand inside of a single region. A single cloud's a single region. In other words, a single model might want to horizontally scale up to 200 replicas, each of which is, let's say, 2H100s or 4H100s or even a full node, you run into limits of the capacity inside of that one region. And what we had to build to get around that was the ability to have a single model have replicas across different regions. So, you know, there are models on Baseten today that have 50 replicas in GCP East and, 80 replicas in AWS West and Oracle in London, etc.* Developer experience for Compound AI Systems: The final one is wrapping the power of the first two pillars in a very good developer experience to be able to afford certain workflows like the ones that I mentioned, around multi step, multi model inference workloads, because more and more we're seeing that the market is moving towards those that the needs are generally in these sort of more complex workflows. We think they said it very well.Show Notes* Amir Haghighat, Co-Founder, Baseten* Yineng Zhang, Lead Software Engineer, Model Performance, BasetenFull YouTube EpisodePlease like and subscribe!Timestamps* 00:00 Introduction and Latest AI Model Launch* 00:11 DeepSeek v3: Specifications and Achievements* 03:10 Latent Space Podcast: Special Guests Introduction* 04:12 DeepSeek v3: Technical Insights* 11:14 Quantization and Model Performance* 16:19 MOE Models: Trends and Challenges* 18:53 Baseten's Inference Service and Pricing* 31:13 Optimization for DeepSeek* 31:45 Three Pillars of Mission Critical Inference Workloads* 32:39 Scaling Beyond Single GPU* 33:09 Challenges with Kubernetes and Infrastructure* 33:40 Multi-Region Scaling Solutions* 35:34 SG Lang: A New Framework* 38:52 Key Techniques Behind SG Lang* 48:27 Speculative Decoding and Performance* 49:54 Future of Fine-Tuning and RLHF* 01:00:28 Baseten's V3 and Industry TrendsBaseten’s previous TensorRT LLM workshop: This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Transcript
Discussion (0)
Welcome back. Right after Christmas, the Chinese whale bros ended 2024 by dropping the last big model launch of the year.
DeepSeek V3. This is a massive 671 billion parameter, fine-grained M-OE model with 256 experts,
trained with native FP8 mixed precision training, multi-head latent attention from Deepseek V2,
a new multi-token prediction objective, and 15 trillion tokens of data, including synthetic
reasoning data distilled from Deepseek R1.
Right now, on the LM Arena Leaderboard, Deepseek V3 is rated the seventh best model in the
world with a score of 1319, right under the full 01 model, Gemini 2, and 40 latest, and above
01mini, GROC 2, Gemini 1.5 Pro and Claude 3.5 Sonnet.
This makes it the best open weights model in the world in January 2025.
There has been a big recent trend in Chinese labs,
releasing very large open weights models,
with Tencent releasing Hun Yuan large in November,
and Hailuo releasing Minimax text this January,
both over 400B in size.
However, these extra large language models are very difficult to serve.
Base 10 was the first of the inference neocloud startups
to get Deepseek V3 online,
because of their H-200 clusters,
their close collaboration with the Deepseek team
and early support of SGLang,
a new VLM alternative out of UC Berkeley
that is also used at Frontier Labs like XAI.
Each H-200 has 141GB of V-RAM
with 4.8 terabytes per second of bandwidth,
meaning that you can use 8H-200s in a node
to inference Deep-Sseek V-3 in FP8,
taking into account KV Cash needs.
We have been close to Base 10 since Sarah Guo introduced Amir Haggy Hat to Swix,
and they supported the very first latent space demo day in San Francisco,
which was effectively the trial run for the podcast you're listening to right now.
Since then, Philip Kiley has also led a well-attended workshop on Tensar RTLLM
at the 2024 World's Fair.
We worked with him to get two of their best representatives,
Amir and lead model performance engineer Yining Zhang
to discuss Deepseek, S.G. Lung and everything they have learned
running mission-critical inference workloads at scale
for some of the largest AI products in the world.
Spoiler.
Amir thinks there are three pillars of mission-critical inference workloads,
and we spend quite some time discussing what you need for each of them.
In other news, invites are now rolling out for the second AI engineer summit
in New York City from February 20th to 22nd.
We are bringing back the surprisingly successful AI leadership track from World's Fair
and the AI Engineering track is now wholly focused on agents at work.
If you are building agents in 2025, this is the single best conference of the year.
We are curating all attendees and will sell out after we announce speakers this coming week
from Deep Mind, Anthropic, OpenAI, Meta, Jane Street, Bloomberg, BlackRock,
LinkedIn, and more.
Look for more sponsor and attendee information at apply.a.ai.org and see you there.
Watch out and take care.
Hey, everyone. Welcome back to the Latenspace podcast, our first recording of 2025.
I'm Alessio, partner and CTO with Deciple Partners, and I'm joined by my co-hosts,
Spooks, founder of Small AI.
Hey, and today we are here with a special
double-guessed episode with Amir, oh my God, I don't know your last hug yet.
That's close enough.
That is good.
That's good.
That's really good.
And Yining Zhang from Base 10.
Welcome.
Thank you.
Amir, we've met before.
You're co-founder of Base 10, which is, you know, one of the leading sort of LM-infrinse
platforms.
I don't know how you, what do you consider yourself?
That sounds fine.
And Ineng, you are lead software engineer on the model performance team.
And you guys recently shipped DeepSeek V3 as one of the many models that you do host.
You also are very involved in SG Lang.
And that was actually one of the reasons we're discussing an episode with you,
even before Deep Seek V3 dropped as a Christmas present to everybody.
So we can take this a number of directions,
but I think one thing we wanted to just get off the bat on was to start with Deep Seek
and then we'll work our way backwards back to SG Lang.
But Deep Seek is kind of more recent.
Why are people so interested?
What's the history of like, I guess, like deep seek in general from your perspective?
Yeah, because deep seek theory, I think, is currently considered the leading open source
al-lams based on the benchmark results and the chat area results.
And it's so big, you know, it's a 671 billion parameter MOE.
And I think it's a game changer for the open source AI.
So everyone is interested in this model.
Yeah.
One of the interesting things is like they are bootstraps, like, you know, very private small lab.
Like they have a lot less fewer resources than others.
But it's also interesting that it's just open weights.
Like for some reason, the Chinese labs are much better than the American labs at sharing open weights.
And that's obviously beneficial for base 10.
It is in your incentive to serve these models at all times.
Like, you know, like what are sort of the unique challenges that you face, like, offering something this large?
Yeah, I think because the model is very large.
And if we use something like, use H100, we cannot serve this model.
Because, you know, even we use the H108 cards, it should be 640 gigabytes memory.
So, Deep Sik 3 model, it has 671 billion ways.
So even use the FP8 precision.
you need, I think, 71 gigabytes for the weights.
And you also need an extra memory for the KV cache.
So it's not possible to run that on H100.
That's why we chose H200 to run that model or use smarting node to run that model.
Yeah, it's very challenging.
And I think another one is that deep sick fee three, the weights is released was the FP8 precession.
So if you want to run it, you should support that kernel.
Because I think the default is the BF16 and even larger.
But if you want to run the FP8, you need to support the quantization.
So I think currently even the TANRTIM doesn't support the FP8.
So if we want to implement that feature, we should do some feature development for that.
The last challenging part is that, yeah, if you want to do some debugging or some performance benchmarking, it's very hard.
Why?
Because the model is so large and the loading time is so long.
Yeah.
So it makes it more complicated for the developer to do some debug.
Is it complicated or just slow?
I mean, you've just only mentioned loading time, but like...
Yeah, loading time is slow.
Yeah, you are right.
It's not complicated.
just go for more coffee.
Okay, okay.
Okay.
Can you maybe just give people a quick rundown of all the models you support on
based on how it compares just on size just to, you know, people here,
671 gigabytes, but like, is that a lot more than other models?
And then you mentioned the BF16, FBA.
What's kind of like the usual that you see?
And do you see any variation based on model size or anything like that?
I think at the base term, something like Lama, 7db,000.
is more common.
I think Lama 3 has released
the 405 billion weights,
but I think there are just a few users use that.
So before the Deep SIG F3,
I think we haven't encountered that issue
for that so large weights.
I think Deep SIG fee 3 is the first one
so big model that we should use H-200
or use H-100 multi-nodes.
And what's that big as a big?
performance or why did people not use the 405B Lama?
I think the, you know, what I hear from people is like, you know,
the performance gains of like the 4 or 5B at inference times are like not worth it, you know,
so the 70B is kind of the sweet spot.
Who are the people that use V3?
Are people that were using maybe the Lama 70B model and just want better performance?
Are people that are just experimenting?
I think that's kind of the question that people always have.
There's always a lot of excitement around open source models.
But then maybe the question is like, what are they,
really good for?
I can answer this.
Just observationally, the interest that we have seen, and some of this is running in
production, some of it is just at the interest level, is generally not coming from folks
who are trying to upgrade from a certain open source model to deep stick V3.
We're seeing it generally speaking from folks who are coming from Claude and are doing so
either because, and here's, I'm going to give you a list of reasons, and generally the reasons
are a combination, a certain combination of these.
In no particular order, either it is they're being rate limited or the price is too high,
or they have certain latency requirements or time to first token requirements for their use case
that cloud cannot hit.
Or they want to have full control over the model as opposed to running it behind an API
where the model underneath them can potentially change.
and a couple of other reasons, but generally it's a combination of those.
You mentioned the speed and some of these things.
Do customers want to change the hardware?
Also, like, you're referring to H-200 is kind of like the default thing.
Like, do people come to you and it's like, yeah, I'd rather use a smaller system and can
have worse performance or how do you work with customers on that?
Generally, people come with certain requirements around the latency, the throughput, the cost.
generally they're not coming in saying, I want this particular GPU skew.
At least like as we go upmarket and we're talking to, you know, foundation model companies, for instance,
the things that are top of mind for them are those requirements, not a particular GPU skew.
We're doing the different GPU skews not because we want to, you know, offer, oh, look, we have H-200s, look at us.
We have migd the H-100s, look at us.
It's not that.
That is really because those are the tools to achieve a certain kind of time to first token for
certain types of models or certain kind of throughput and scale or a certain kind of price per
million token or what have you per million images depending on the modality.
And that's the reason that we're talking GPUs.
I wanted to pick up a little bit on this FP8 thing.
It seems like, you know, I think Noam Shazir started talking about sort of training natively quantized.
And I think that's what Deep Seek seems to have done in, as at least they said in their paper.
Is this a trend?
Like, is the community settling on one form, one sort of numeric that everyone knows about.
Tell us more about what you're seeing here in terms of like the training trends, those sort of model trends and the, I guess.
So I think a lot of companies as well, like together, they'll also release like quantized versions of the Lama models, right?
for like turbo or light inference,
you know, like just based on different levels of speed.
Like, do you do anything there in terms of quantizing the models that you serve?
I'll let you know, the sort of patterns around, you know,
using FP8 in training.
But I want to draw one distinction that I think gets to the latter part of your question,
Sean, which is that unlike companies like, you know,
you mentioned together or fireworks or replicate,
based on that doesn't provide a shared inference endpoint for the population.
popular open source models.
That is a product that we don't have on purpose.
Those work really well.
The shared open source,
shared inference endpoints for open source models work really well.
For situations where the user is saying,
hey, let me just call a certain popular model behind an API,
pay by the token.
That is not our average customer or the median customer.
Our customers generally have their own custom models,
very custom workflows,
strict requirements around latency and time to first token and can't deal with noisy neighbor
problems that like, oh, the API is slow because some other customer has been calling it a lot.
Other requirements around infrastructure flexibility, around regions for latency reasons,
for compliance reasons.
That's the side of the inference market that we capture.
And at base 10, when you deploy a model, whether it's your own custom weights or an open source
model, you get dedicated inference, dedicated resources, dedicated inference. And so when it comes to
the quantization question, where it matters is that we would never quantize the model behind,
you know, the users back and say, look at us. There's a faster Lama 70B that has been,
you know, somehow quantized to be faster and cheaper. Our customers are coming to us with those
requirements that I mentioned. But in particular, when it comes to model quality, they have strict
requirements. They would not be okay with us touching the weights, if you will. We have done
things like speculative decoding with a couple of different ways, but all of those things,
those methods guarantee the output is unchanged as opposed to quantization. So when it comes to
quantization, we have built tooling that allows our users to quantize their models for the ones
that we're working more hands-on through our Forward Deployed engineering team. We're working with
them on e-vows as well to ensure that the quantized models are meeting their
their requirements. However, this is all very much in conjunction with the engineers that are
customers, as opposed to us doing it behind the scenes. Yeah, FP8 training is very interesting.
And I think the deep-seek team is the first one to use FP8 training for the large model.
I think before that, maybe Lin-e.a.I, sorry, 0.1.aI. Say they use the E Lightning. They use
the FP8 training and others, I think most of them use the BF 16 training and yeah, it's
the game changer.
And for us, because the FP8 kernel should be implemented inference, it used the blockwise
FP8.
And currently, even you use something like Kudau Kool-Blas, you cannot support that.
So you should use something like treatment to implement the kernel or you should use something
like a catalyst to implement that kernel.
So I think that's the challenging part.
My theory is that this will pick up in terms of the models that people release.
Like increasingly, it will not be a BF16.
There was a bit of quantization.
I was trying to look for the paper while you were speaking, but I couldn't find it.
But there's sort of this ablation of quantizations paper that was out last year
that showed that there is benefits to quantizing and sort of natively training all the way
until like six-bit, and then even smaller than that,
maybe going too far.
Yeah, I'm not sure if you know what people are talking about,
but there's an interesting trend for sure.
Yeah, I think even they used the FP8 quantization,
the benchmark result is very good,
such as something like GSM8K.
The score is nearly 94.6.
It's so high, you know.
I think it's higher than every other open-source LM,
even the Lama 400,05 billion.
So I'm going to move on a little bit in terms of like one other notable detail and then,
you know, we don't have to speak too much about D.C because obviously we don't know
that much unless we work on the team.
But like, you know, another trend that they have is the fine-grained M-O-E.
I think that there is this question about whether or not M-Oes will be more of a thing.
Basically, like this time last year, Mixed Rao was sort of kicking off a bit of an M-O-E trend with
you know, 8 by 7B, 8 by 22B.
And then the rest of the year, no MOUES basically.
So is this like discovery of fine green MOUs going to be a relevant trend for this year?
Yeah, I think so.
Because as far as I know, some companies such as Baidu or Baidu dance,
their internal dominant ALM, they use the MOE architecture.
And their ways, I think it's similar to the deep seek MOUE model.
So I think after this new year, the M-O-E inference optimization will be very essential and, yeah, yeah, so important.
At the same time, like, why hasn't, you know, the big labs done, we haven't the big labs done it, right?
Like, I think, well, Lama 4 or 5B is dense.
I think rock is also a dense model.
You can correct me if I'm wrong.
And, yeah, it's just a weird counter-tren.
I think this time last year, I was kind of writing my recap and I was like, all right, M-O-E's, like, seem like they're
going to be trending and then they did not trend. Anyway, so it's just just a note that I would
flag out there. But I generally agree. Like, it seems like fine grain of moye is working out
and I would definitely see, definitely want to see more people adopting it. I went to Jeff Dean's
session in Europe. And he also mentioned as well that I think one of the Gemini, I think
Gemini, I think Jemite, which I don't think we knew before that. Yeah. Yeah. So the reason why Lama
open source MOUs model, because I think they try to train
M-O-E model, but they failed.
So that's why they didn't open-source M-O-E model for Lama Serious.
Why are the causes of failure?
Like, why do M-OEs fail?
Like, I think this is another thing that people are talking about, right?
Like, the failures of 3.5 opus, the failures of GPC5.
Like, you know, it's a thing that people are sort of rumoring.
Yeah, because if you want to train model, for the training staff, you need to some benchmark
or some, yeah, score.
But for the M-O-E model, the benchmark score.
is even lower than the dance model.
So in that way or in that case,
they think of the MOU model is worse than the dance model.
So they didn't release that MOUE model.
Okay.
Well, then one more thing, I guess,
maybe more commercially relevant.
D6 API pricing is very competitive.
How do you decide pricing in this kind of landscape with open models?
Yeah, so it goes back to the use cases that we serve.
And again, going back to the fact that we don't have share.
inference endpoints for the different models. And so our pricing is never per token. Customers,
like I said, generally come with their own custom models or open source models, but with strict
requirements around certain things, latency in time to first token requirements or security
and compliance requirements or a particular scale that they're looking for without running into a
noisy neighbor problems, things like that. And the way that we price has always been based on
consumption based on consumption of resources. And that takes one of two shapes. One is the shape where
things are running inside of our infrastructure. And we're running, by the way, on top of multiple
different public clouds, many different regions within those. Then we charge them based on the
hardware that the resources that they're using. The second shape that it takes is that our customer
brings their own cloud. And we're seeing this more and more where a customer has big committed
resources inside of their AWS VPC or GCP or what have you.
And in that world, we also have a consumption model.
Of course, the price is very different because they're using their own resources, but we are
managing those resources for them.
An example of that that's, you know, we're seeing more recently as the fact that we've had
to build multi-cloud capabilities for us so that we can have a single model horizontally
replicate across different regions and even different clouds.
And more and more as we go up market, our customers have their own cloud commits.
They are also multi-cloud in order to be able to get good prices and good capacity.
And it's unreasonable to expect every one of them to build the same multi-cloud capabilities
that we have built.
And so then they take advantage of what we have built and use all of the different cloud
resources that they have as a whole holistic unit.
and have their models at inference time horizontally scale across those and even optionally overflow
to our cloud when they start running out of committed resources. All of that has a consumption
pricing model to it. Can we talk about what it takes to actually run your service? So we had episodes
with, you know, replicated model of our works. We always like to ask this question. Obviously,
since you're not the model maker, all the secret sauce is in how you actually run the model. I know you
also have Truss, which is your more developer-led SDK.
Can you maybe quickly run people through, how do you go from taking the deep seek V3
weights to actually run it, what goes on behind the scenes?
And then we can talk about SG Lang in depth a little more.
Yeah, totally.
So we have, like you said, we have Trust, which is our open source model packaging and
deployment library.
Trust works with different frameworks underneath it.
It has very native and deep support for.
Tensor RTLM, somewhat as an accident of history, we happened to have access to TRTLM before
it was announced, contributed back to it, and we still do pushed it to its limits and had to go
beyond it in certain areas as well.
So for example, if you know the Triton Inference Server, we've had to build our own version
of that for performance and reliability reasons.
But we invested in it heavily because it tended to be for the use cases that we were seeing
from our customers, it tended to be the best framework to handle the latency and throughput requirements
that we were seeing. In particular, when it comes to the kernels that they come with, I'm yet to
see folks do better than what Nvidia can do when it comes to CUDA kernels. However, trust is not
tied to TensorFlow. For example, for the deep-sick example that you mentioned, is working with SGL,
which is really cool to see. And we will be investing more and more on SGL, especially as
the developer experience is just so much better than TensRTLM.
We've built a lot around TensarRTLM, productize them too, to make it easier to work with,
but still SGL has been a joy to work with.
Another trend that is really promising,
and I learned this from the SGLM folks,
is that the Tensara TLM folks have promised to modularize a lot of TRTLM,
so that other frameworks, like SGL, can grab certain parts of it,
and build on top of it.
And so as a user, you don't have to go all in on one framework versus another.
You can really pick and choose based on the requirements that you have.
And that's really been our approach as well.
We have customers on B-TL and we have ones that are using TensRTILA,
and we have a growing number that are using as G-Lang 2.
It's not about really tying yourself to one versus another.
It's about using the best of the bunch,
depending on the requirements of the customer and for their inference workloads.
How did you think about designing the framework?
So replicate also a COG, which was kind of more tied to Docker.
What were maybe some of the design decisions that you had?
And how do you think that that's changing, especially as the models change and like the
runtime's change?
Yeah, totally.
So we started trust, gosh, like four or five years ago.
And at the time, the sort of principle that we had,
in mind was let's make sure that easy things are easy but hard things are possible.
And so an example of easy things being easy is that, you know, think of it as a very simple,
you know, you have a model.
What do you need to do to serve it?
Well, you need to load it up and then you need to write the code for the, for the inference path.
And, you know, trust actually had, you know, hooks for these two things.
And so you could just, you know, write two functions and voila, your model was being served,
at least as a single unit.
We can talk about the horizontal scaling part separately.
That's a whole different topic.
And so we did well when it comes to easy things being easy.
I think we struggled with the hard things being possible in the early days.
Hard things, example of hard things are cases where we're seeing where more and more
our customers have their own custom models, custom models that sometimes they fine-tuned,
sometimes they've pre-trained.
You know, we now have six or seven foundation model companies as customers who are
sophisticated enough to pre-trained their own models and they're trusting us with the inference layer.
That's not a situation of, hey, here's two functions, good luck.
I have to have much deeper integrations with them.
And so that's where we started rethinking some of the abstractions of trust over time to allow
for those custom use cases.
And that has been successful.
Another place where we didn't think about at first but became important was seeing more
and more use cases where the customer was saying, I can serve my models on Baystand using
trust fine, but my use case is not just call the model, get the response, and run with it.
I actually have a multi-step inference workload. So an example of that is the company
bland AI with their AI phone calls. To make an AI phone call happen, you need to transcribe
what the human said, a couple of LLM calls to figure out what to say back, and then text
to speech to actually have the end-to-end workflow work.
Now, you can have these three separate models, three separate deployments, but think about what
happens is that you have to call the first model, wait for the response, call the second model,
wait for the response.
All of that network back and forth is killing you.
The latency is becoming too high.
That's not something that we had designed for initially.
And so that's when we came out with trust chains, which is the devX for building these multi-step,
multi-model inference workloads, but doing so in a very very very important.
very low latency way.
So that instead of you orchestrating all of these calls and incurring all of that network
latency, you're actually making one call and these models are actually talking to each other.
They run independently on their own hardware, with their own auto-scaling behavior,
but the data from one to the next step is being actually streamed.
And that way, going back to the AI phone call use case, you can get sub-400 millisecond latency
AI phone calls that actually feel very realistic.
And those are all models hosted on Base Stand? Or do you also do a change?
Those have to be models hosted on BASTA.
And if one of those steps is not hosted on Base Stand, then you still incurring a massive latency on the network side.
Yeah. And then just to maybe tie this into SG Lange, how do you kind of think about the hidden magic, you know?
Should people know they use SGLang?
Like should people care?
Like especially for the people building the models, you know, like it doesn't matter to them.
They use a certain model runtime, or do they not care?
Everything just goes through the base-time platform the same.
Yeah.
Should we talk about it?
Yes, 100%.
We want to be the transparent provider.
I don't want to say, oh, just give us your model and voila magic and trust our magic.
I want that magic to be very transparent to our customers.
That has worked really well for us.
And you really need that, especially when you're onboarding, you know,
foundation model companies, and they're not going to just turn a blind eye on how things are run
underneath the hood. When it comes to customers caring about what's happening underneath,
they do, but more than caring about this framework versus that, they care about how the final
output, in other words, is the quality the same or somehow something has changed underneath the
hood and the model isn't actually producing the same quality. How is the latency? And especially for
certain use cases, what is the time to first token? And is that sustained? What is the P95 of that?
What is the P99 of that? How well does it handle throughput? When you start getting a massive burst of
traffic, does it still sustain those P95s and P99s? How do I make sure that the security of the data
being sent into the model is guaranteed? How do I make sure compliance is guaranteed for HIPAA use cases?
And how do I make sure that the data remains within a certain geo for compliance reasons or for latency reasons?
And so those are the things that those are the concerns that the customers are coming to us with.
Let's so about, hey, here's my model.
Make sure you run it with your TRLM or make sure we run with SGL link.
Can you maybe give us an overview of all the different frameworks that people want to use?
So you have SGLang, TRTLLM, VLM.
Those are kind of like maybe the open source research ones.
And then some of the other commercial companies are building some of their own stuff.
But what's the state of the art today?
Maybe like the top three most popular.
And then we can talk about why SG Leng can to be and what makes it different and
some of the performance boosts that you get.
Okay.
Yeah.
I think for the common use case, maybe not the DPSIC, fee three.
For the common use case, I think SGLAN's performance is better than Firm and its usability
is better than TNRTLM.
So when user care about the performance and the usability or the, I think they will choose SGLON.
And for the DIP fee 3 case, because we do a lot of optimization in SGLAN, something like
DeepSIC fee 2, they propose attention parent named MLA multi-latent attention.
And I think SGLOA is the only framework supported that.
Maybe Lightalm and TRTM also support, but FIM doesn't support.
And also in DeepSeek, sorry, in the SGLon version 0.4, we also support the DPR attention for deep seek.
And in the latest SGLAN release, we also support the Blockwise FP8 kernel.
And that kernel was adopted and copied by PMM later.
So I think we have done a lot of optimization.
for deep seek. That's why SGLon is the recommended engine by the deep seek team.
Maybe one thing to point out, and I think this is important, is that the framework that you
choose is part of the equation for running mission critical inference workloads,
but it's only a part of it. So maybe I can draw this out just based on the experience,
based on what I've seen in the market, as to what it takes to run mission critical inference
workloads in production. I think it takes three things. And each of them individually is necessary,
but not sufficient. One is performance at the model level. So in this case, how fast are you running
this one model running on a single GPU, let's say? The framework that you use there can matter.
The techniques that you use there can matter. The MLA technique, for example, that Eneng mentioned,
or the kuro kernels that are being used. But there's also techniques being used at a higher
level, things like speculative decoding with draft models or with Medusa heads.
And these are implemented in the different frameworks, or you can even implement it yourself,
but they're not necessarily tied to a single framework.
But using speculative decoding gets you massive upside when it comes to being able to
handle high throughput.
But that's not enough.
Invariably, that one model running on a single GPU, let's say, is going to get too much
traffic that it cannot handle. And at that point, you need to horizontally scale it. That's not an
ML problem. That's not a pie torch problem. That's an infrastructure problem. How quickly do you go from
a single replica of that model to five to 10 to a hundred? And so that's the second, that's the
second pillar that is necessary for running these machine critical inference workloads. And what does
it take to do that? It takes, as some people are like, oh, you just need Kubernetes. And Kubernetes,
It's an auto-scale, and that just works.
That doesn't work for these kinds of mission-critical inference workloads.
And you end up catching yourself wanting to, bed-by-bed,
to rebuild those infrastructure pieces from scratch.
This has been our experience.
And then going even a layer beyond that, Kubernetes runs in a single cluster.
It's a single cluster, and it's a single region tied to a single region.
And when it comes to inference workloads and needing
GPUs, more and more, we're seeing this, that you cannot meet the demand inside of a single
region, a single cloud's single region. In other words, a single model might want to horizontally
scale up to 200 replicas, each of which is, let's say, 2H100s or 4H100s, or even a full node,
you run into limits of the capacity inside of that one region. And what we had to build,
to get around that was the ability to have a single model have replicas across different regions.
So, you know, there are models on Bayesian today that have 50 replicas in GCP East and 80
replicas in AWS West and Oracle in London, etc. And that was a big investment that we had to make.
The final one is wrapping the power of the first two pillars in a very good developer experience.
to be able to afford certain workflows like the ones that I mentioned around, you know, multi-step, you know, multi-model inference workloads.
Because more and more we're seeing that the market is moving towards those, that the needs are generally in these sort of more complex workflows.
So these are the three pillars that it takes to run mission critical inference workloads.
And the choice of the framework, the serving framework, is really a part of the first pillar.
And that's something that I'm seeing, the market that, like, people who are somewhat new to it,
they're like, well, VLM equals equals production.
That's what it takes to run inference workloads.
And that's in practice, that is not true.
And I wanted to call that out.
I agree with them, yeah, because I think it's open source libraries such as VRM, SGLLLM, or TansRTRTM,
they only provide a library.
They don't provide a product solution.
Yeah, can we maybe talk about some of the SG Lang unique things?
So I've read through the paper.
sounds like some of the main use cases, like when you have very large batches, which makes sense
for your use case and also kind of longer context.
You know, what was the decision buying, creating the framework?
Which I think is like around one year old.
I think the paper came out of December, 23, something like that.
So it's still still thoroughly new compared to some of the other ones.
And then maybe what were some things that you had to change, you know, as you built it or any, any fun stories?
Yeah, yeah, yeah.
I think last year, or not last year, sorry, at 2023, maybe August, at that time,
Viaming and Ian want to create the SGLLN, maybe for the front end, something like LLM program.
They want to solve that problem.
And at 2024, January, they support something like Redix Cash.
It's a prefix cache technology.
I think SGLon is the first framework that support prefix cash.
And at February, the first framework that support prefix cash.
And at February, they also support concentrated decoding and support some jump forward.
So at that time, it's a no for the language generator, not the inference backhand.
And at 2024, July or June or July, we want to make SGLAN a fully functionality, LM inference engine.
It's just equal with equivalent with the VM or with the TensRTLM.
So at that time, we published a blog compared with other frameworks, and its performance is amazing.
Yeah.
At that time, I think its performance is maybe three times.
It's throughout its point is three times than FM.
So after that, FIM also do some refactor to make it faster.
And at September and December, we continue to release new versions for SGLAN.
Yeah, we will support some deep-seek optimizer.
such as MLA optimization, DPR tension optimization,
and we also support the Sierra overhead CPU schedule.
Also, we support something like SGLON router for the cashier wear, load balance.
Yeah, we deliver so many features.
We just build and the ship.
And I think why Lieming and Yin want to create a new framework
rather than use the existing solutions such as Fium or TenserRT,
because at that time, you know, at that time for the 3M, I think it's easy to use,
but its performance, maybe it's not good.
Some design, I think it's not okay.
Yeah, maybe the code is a little messy, and if you want to extend some new feature,
on top of that, it's a little hard.
And 1030RTM, I think it's blazing fast.
Its performance is so good, but it's not easy to do some secondary development.
If you want to add some new feature, it's a little hard.
They just think about how can we create a new framework?
It can achieve the good performance.
Also, it's easy to develop to maintain.
So that's why they create the SGMUM project.
Let's run through maybe the three main techniques behind SGELang.
So the first one is RADX attention, which focuses on KV Cash.
And when you think about a model that is, you know, as large as deep seek V3,
especially like having better KV cash reutilization.
It's great.
Can you just talk a bit about that performance impact?
Yeah, Reddick's cache.
I think it's the technology of the prefix caching.
And it's a special case for something like block size is one.
You know, for VM or for other frameworks,
they use something block size 32.
And SGLL use the block size one.
I think if you use the block size one,
you can make the cash heat rate higher.
than other frameworks.
I think that's the many benefits.
And for your case specifically,
how does that change when you have
like a base 10 type use case
where you don't, do not have a share endpoint
versus like, you know, is this less helpful
for GPU clouds to do one model for like many people
that have like very different use cases
versus like when you have just one endpoint for one customer?
I'm sure to have a system problem that a lot of models share
and things like that.
Anything you want to mention there?
Yeah.
We've seen this be massively helpful for the reason that you mentioned.
There is a certain sort of finite number of prompts or at least prompt prefixes that are being used per customer.
And what we've seen is that prefix caching and the different techniques to make that better has been massively helpful.
However, we still had to build on top of that.
The example there is that you have a model with dozens of replicas, each of which has its own state of KV Cash.
A new request comes in, and what we used to do back in the day was that that request would be randomly assigned to one of these replicas.
But the better way to do it is that knowing the state of KV Cash in these different replicas, trying to decide which one it should go to.
One of the parameters that you need to consider, there are other parameters to consider around the size of the queue at each of the replicas and the location of each replicas, depending on how geo-aware you want to be.
But adding that additional consideration around KV Cash-aware load balancing, well, was something that we saw improve latency quite a bit for our customers.
And then the second part, which was maybe the harder one to understand as a practitioner, which is this idea of like turning some of the decoding process into a finite state machine instead of a more open-ended.
When you're using, especially with structure outputs, can you maybe explain what that means?
And I would love to learn too.
So maybe this is an opportunity for everybody to better understand how you think about going from a normal kind of like token by token decoding to having a more.
I wouldn't say pre-compile, but like pre-understanding of what the paths are going to be.
I think SGLO support concentrated decoding, and it also support jump forward.
And we use something like outline or the X grammar to do the something like change the comfort of the schema from JSON to the FSM, the state machine.
And we can use the state machine to control the output, something like the output may be to the JSON mode or something.
like, it should be obey some rule.
So in that case, because the output should obey some rule, so you can skip some
token, something like you should decode four times, but you should obey that rule.
So, or you can get that token in advance.
You can just use one preview to replace the four decoding, yeah, for example.
So that's why you can jump forward.
I guess the question is like, why doesn't everybody do that?
When I was reading, I was like, this just sounds better, especially both for accuracy.
You know, you're kind of constraining for structure output as well.
You can do faster decoding.
Are there downsides to it as well?
I think maintain jump forward is a little hard.
Yeah.
At later, we support something like CPU overlap.
I think in the overlap model, we even make it compatible with the jump forwarding.
Because just if you want to maintain the jump forward with other features, it will be more complicated.
So I think we only use it a fault setting.
We disable it by default.
But if you want to enable it, you can just use some arguments to enable that.
But it's a little hard to maintain, especially compatible with other optimization features.
Just as a side note, you mentioned Xgrammer, which I never heard about, and I looked up the GitHub repo.
It's actually from MLC, which we talked to TQ.
Yeah, yeah, yeah.
I think a while ago.
Any comparisons between X grammar and outlines?
Is there a trend in these?
in this world or is it mostly settled science?
To be honest, I prefer ex-gramma, you know.
Okay, yeah, tell us.
Yeah, M-IOC AI is funded by Tianqi, both Tianqi and ex-grammas or the Yishindong.
They were the students graduated from Shanghai Geochalun University.
And the creator of S.GLang, Lianin and Yin, they also graduated from Shanghai's
Oh my God. Is it the Berkeley of China?
Yeah, yeah, right. And I think X grammar's performance is better than the outlines.
And also in the TansartRTLM's latest release, TenseRTM also integrate XRama as the backhand for the concentrated coding.
Okay, this is new to us. We had Remy from Outlines speak at my past conference, but I didn't even, I wasn't even aware of X grammar being.
being a thing. But yeah, I mean, structured output is something that a lot of people care about.
We had OpenEI talked about their structured output implementation, and there's a lot of interest
in making sure that there are no tradeoffs. I think there's a little bit of fud around
how maybe the models are dumber when you use structured output instead of the sort of base
sort of next token generation, but I don't think it's significant that much.
Yeah. We can talk about the last one, which I don't know if it's as relevant for base 10,
which is the third technique of SG language is API speculative execution,
which seems to be only for API-only models.
Oh, yeah, I think it's the front-end feature.
It's not the backhand.
Yeah, something like you have some control flow for the LM task,
such as you have the one request to get a result,
and you just continue to another call.
And for this case, you can use the SGLAN front-end language
to describe the control flow.
it will make it easier to control that pattern.
Okay, awesome.
tracing this human path, I'm pretty sure I know the answer,
but is there a reason that big projects like GROC, you know,
like X-AI also use SIG-Lang?
Yeah, yeah, yeah, right.
Is it just the same people?
Yeah, yeah, yeah, right.
Lian and Yin are the X-AI's member of technical staff.
I mean, it makes sense.
I wonder if it, you know, what's the impetus for S-G-Lang
to kind of break containment.
It seems like VLM obviously has the,
it's one year older, it has more community pool.
I wonder how this will shake out.
I don't really know.
But you know, you said it's a library.
You say VLM's library,
but SELN is much more comprehensive.
I mean, do people care?
Maybe it's like when you're serving models at scale,
then you start really prioritizing the sort of performance
that S.E Leng offers.
I think if you care about,
the performance.
Maybe TensorFlow
is the best
solution for now,
especially for
the latency
sensitive scenery,
Tanzan RTM
doing well.
But if you also
want to implement
some feature
by yourself
or do some
optimization by yourself,
you want to
customize the framework.
I think SGLon is a
good option.
And VM,
I think FAM's
community support
is very nice
because it was
used by so many
users and it
It has so many GitHub stars.
And, you know, SGLAN, when I participate in the SGLO team at July, it has only 2,000 stars.
And right now, it has more than seven stars.
Yeah, I think it also grows so fast.
Anything that people should look forward to that's on the roadmap for Shilang that people should be aware about?
Yeah, yeah, yeah.
And we post the roadmap in the issue, and we pin that issue.
Also, we have biweekly meeting to.
think with the community about our progress, our plan, which feature do we want to implement
in this quarter, something like that. And we also co-host some meetups, something like the first
meetup we co-host with the ML, CLM and Flash Infer. And we also, yeah, participate
at some hackathon, something like Camor AI hackathon. We do the presentation about SGLAN, yeah.
I just saw it now. It sounds like actually that there's, you know, we mentioned Eagle and mentioned
Medusa. I think Amir mentioned Medusa, but Eagle is also part of that.
Cabal's speculative coding techniques. It looks like not yet supported.
Yeah, we already support it. And I think in the open source implementation,
something like FIM, SGLO and other frameworks, I think it's the SOTA performance.
And currently, even use the 10th RTM, it only supports Eagle 1, not Eagle 2.
And one thing to note about speculative decoding different versions of it is that,
But the framework supporting it is one thing, but you will have to do the job of the training
of the draft models or the additional heads.
And a lot of the benefits will come from how good you are at the training aspect in terms
of the data that you use to train the draft model, to essentially distill the target
model or mimic its behavior so that you can have a very high.
rate of acceptance.
It's the kind of throughput
improvement that you get
are ultimately dependent on how good
of a job you do at training
the draft model in the case of
the draft target model mechanism.
So that's another thing that's like,
does the framework support it or you can just
turn on speculative decoding
with a flag? That's
not the case. There's more that
goes to it. One more
sithing on training, I also noticed that
with OpenEI offering fine-tuning for 01 and all these things.
I think people are also very interested in sort of RL trainers is what you have here.
It looks like you're supporting Hanging FACE TRL and Open RLHF.
Do you think that this will become something that a lot of people are demanding,
like the sort of general feel of RL for LLM's relatively abandons, I think, up till like the end of last year,
basically?
Yeah, I think that's all.
I don't know.
It's like it's one of those things where like maybe.
people have to wait for a base model that has some layer looping or some other sort of friendly
architecture for reasoning instead of just pure RL on LLMs because I think so far people have
not really exploited RLHF as much in the wild. I mean, correct me if I'm wrong. Yeah, so I can
give you some examples of when we've seen at work, again, this is generally done by our customers
before they come to us for inference.
But there's examples like in the healthcare world,
fine-tuning models for understanding medical jargon,
like for a whisper, for instance,
you know, a version of whisper that I can actually understand medical jargons.
That's not, that's non-LM use case.
In the LLM use case, staying in the healthcare space,
models that can do medical document extractions
and do a very good job compared to even, you know, state of the arts,
because of the data that the company had gathered through human in the loop.
Are those going to go away?
The need for those are going to go away because there's a model that can do reasoning and do a very good job at it?
I don't know.
My intuition says, yes.
Will it be cost effective?
That's the question that I have.
And in the short term, no, in the long term, maybe.
But I haven't seen the need for the more traditional fine-tuning actually go down.
In fact, we see that quite a bit right now in the market.
The question for us is, do we want to address that market, knowing that the entire market might go away one day?
And my general answer to that is, let's solve today's problems.
Even if they're not around in two years, you will learn a lot along the way by onboarding customers that have today's problems.
And you'd learn from them about tomorrow's problems and you will build a head for him.
What, what's, hang on?
Why do you think fine-tuning might go away?
Because, like you said, there's going to be models with complex reasoning capabilities
that can actually figure it out in a, you know, few-shot kind of way without needing a large
dataset to fine-tune the model with.
That's what some people are saying.
I really have a trouble believing that that'll be the case.
Much more believe that it's just easier to change your prompts.
rather than actually do full fine tunes or even parameter efficient fine tunes.
Is there anything else that we haven't touched on that you wish the people ask you more about
because it is something that is very interesting from your point of view of what you're seeing among your community.
Yeah.
When we release the deep stick fee stray support, we have some community user, something like Kursor.
Do you know Koso?
I think it's very popular.
Of course.
Of course.
I use them every day.
When I type code dot inside of my ID, my terminal, it actually opens cursor instead of VS code.
I feel very bad for VS code.
And when we release the deep sync fee storage support, some employee from the cursor team, also very interested in our implementation and reach out and ask some questions from us.
So I think as SGLON grows faster and the features optimization, we either.
iterate so fast. And I think there will be more users from different companies, from different
teams to use it. Honestly, I would go back to what I emphasized earlier, which was that I wish
more people asked about what it takes to run mission critical inference workloads. Because I see
this in the market sometimes that they're like, well, I can just use VLM. And that puts my model
behind an API and that is that is production but really it takes three pillars that all need to be
there one is performance at the model level that is where you know frameworks that we talked about
today really help you with but you still have to guide them like when it comes to you know
speculative decoding yes they support it but you know who's going to train or or fine tune the
draft model or the medusa heads or who's going to ensure that the reliability of the
VLM server that you see it in production, like, you know, there are crashes.
How do you recover from those without affecting production traffic?
But by itself, that's not enough because invariably that one model running on a set of
hardware is going to get too much traffic that it cannot handle, and at that point,
you need to horizontally scale it.
And that's not an ML problem.
That's not a PITRs problem.
That is an infrastructure problem to ensure.
ensure that you can horizontally scale up your model extremely fast to meet your P90, P99,
latency requirements, and to ensure that you're not running out of capacity in a single region
that that model lives, you end up having to scale that model across different regions
and across different clouds even to ensure that that model is not being starved of resources
in the one place that it lives. So that's an area of investment that,
We started investing in some time ago and really paid off this past year.
And the third pillar is enablement of workflows,
workflows such as the sort of AI phone call example that I told you about,
the ones that require multi-step, multi-model inference,
but in a very low-latency way.
That's a third pillar that really allows the developers to be able to use the power of the first two pillars
and then be able to combine them when, especially you need multiple models for your workflows
and doing so in a reliable way, a repeatable way, and a low latency way.
And that is, those are the three pillars, honestly, that we have been investing a lot on,
some of which we started investing in three years ago and it really started paying off,
you know, a year ago.
So it takes quite a bit of build to get it to the point where you're,
truly running folks' customers' mission-critical inference workloads.
What do I mean by mission-critical inference workloads?
Inference where if inference is slow or down, the main product of our customer is slow or down.
So they really care about it.
They have strict requirements around latency,
around being able to support large throughput,
about being able to do so in a way that other customers' usage doesn't affect the SLAs
that they are getting, dealing with noisy neighbor problems, an inference done in a compliance
way, whether it's HIPAA or certain SOC requirements, and also inference done in a geo-aware kind of way,
both for compliance reasons and also for latency reasons, where where you forward the traffic
has an impact on latency in situations where 50 milliseconds really matter, 100 milliseconds really
matter, and we're seeing more and more of those use cases.
Well, one way I would recommend doing that is kind of a manifesto type of thing.
I'm sure you know Heroku's 12-factor app.
I've seen that.
Yes, yes.
That's a good idea, actually.
Yeah.
Maybe even put it on a separate property than Base 10 and just go, like, here's what we think, you know, mission critical A application should be and, you know, have some thought leadership there and flesh it out and, you know, see if the market takes it on as a mission.
And obviously, you will be best prepared to serve that market as well.
I've also seen this done very well with Enterprise Ready.io.
I think it used to be done by, I think it was called gravitational or replicated.
Yeah, yeah.
From a replicated.
Yeah, yeah.
These kinds of things, when you have a list of requirements when you're like, look, everybody needs this.
Okay, like write them up and then, you know, put a little bit of marketing on it,
spit it out from the main company brand.
And that tends to work very well.
Yeah.
Good idea.
Cool.
Well, thanks so much for your time.
I think this is a really good dive into both face 10 NSG Lang.
and a little bit of Deepstick V3, which people are very interested in.
I'm trying to talk to them as well because obviously they're a fascinating lab.
But I think you guys are doing a lot to make it accessible for everyone.
So thank you so much.
And just to get based on some street cred, you know, they were one of the first sponsors for
latent space events.
And Amir brought 100 croissants at our Latens space hackathon in 2023.
So yeah, just want to bring that up always tell.
I saw Phil.
and at Ablias Reinvent and I told them that was one of the first events that we really did
and one of the turning points of this industry as far as community goes in my mind you know everybody
everybody everybody was there the croissons no no not the croissant the event itself yeah entire
company's launched yeah you were i mean you know like a natter from brev was there and did the
with joseph from roboflow they did like the prom battle thing like harrison was a judge and it's like
Jerry from Lama Index was there, like kind of like everybody that is kind of now breaking out.
If you look at the graph that Jensen put on on the screen at CES with some of the companies
that work with, a lot of them were at that event.
So yeah, thanks for staying involved with us, Amir, and I'm sure we'll do more together.
And thank you guys.
Many more years to come for sure.
Thank you for taking the time today.
Good to see you both.
Let's see you, Sean.
