Latent Space: The AI Engineer Podcast - Everything you need to run Mission Critical Inference (ft. DeepSeek v3 + SGLang)

Starting point is 00:00:00 Welcome back. Right after Christmas, the Chinese whale bros ended 2024 by dropping the last big model launch of the year. DeepSeek V3. This is a massive 671 billion parameter, fine-grained M-OE model with 256 experts, trained with native FP8 mixed precision training, multi-head latent attention from Deepseek V2, a new multi-token prediction objective, and 15 trillion tokens of data, including synthetic reasoning data distilled from Deepseek R1. Right now, on the LM Arena Leaderboard, Deepseek V3 is rated the seventh best model in the world with a score of 1319, right under the full 01 model, Gemini 2, and 40 latest, and above 01mini, GROC 2, Gemini 1.5 Pro and Claude 3.5 Sonnet.

Starting point is 00:00:55 This makes it the best open weights model in the world in January 2025. There has been a big recent trend in Chinese labs, releasing very large open weights models, with Tencent releasing Hun Yuan large in November, and Hailuo releasing Minimax text this January, both over 400B in size. However, these extra large language models are very difficult to serve. Base 10 was the first of the inference neocloud startups

Starting point is 00:01:23 to get Deepseek V3 online, because of their H-200 clusters, their close collaboration with the Deepseek team and early support of SGLang, a new VLM alternative out of UC Berkeley that is also used at Frontier Labs like XAI. Each H-200 has 141GB of V-RAM with 4.8 terabytes per second of bandwidth,

Starting point is 00:01:48 meaning that you can use 8H-200s in a node to inference Deep-Sseek V-3 in FP8, taking into account KV Cash needs. We have been close to Base 10 since Sarah Guo introduced Amir Haggy Hat to Swix, and they supported the very first latent space demo day in San Francisco, which was effectively the trial run for the podcast you're listening to right now. Since then, Philip Kiley has also led a well-attended workshop on Tensar RTLLM at the 2024 World's Fair.

Starting point is 00:02:21 We worked with him to get two of their best representatives, Amir and lead model performance engineer Yining Zhang to discuss Deepseek, S.G. Lung and everything they have learned running mission-critical inference workloads at scale for some of the largest AI products in the world. Spoiler. Amir thinks there are three pillars of mission-critical inference workloads, and we spend quite some time discussing what you need for each of them.

Starting point is 00:02:48 In other news, invites are now rolling out for the second AI engineer summit in New York City from February 20th to 22nd. We are bringing back the surprisingly successful AI leadership track from World's Fair and the AI Engineering track is now wholly focused on agents at work. If you are building agents in 2025, this is the single best conference of the year. We are curating all attendees and will sell out after we announce speakers this coming week from Deep Mind, Anthropic, OpenAI, Meta, Jane Street, Bloomberg, BlackRock, LinkedIn, and more.

Starting point is 00:03:28 Look for more sponsor and attendee information at apply.a.ai.org and see you there. Watch out and take care. Hey, everyone. Welcome back to the Latenspace podcast, our first recording of 2025. I'm Alessio, partner and CTO with Deciple Partners, and I'm joined by my co-hosts, Spooks, founder of Small AI. Hey, and today we are here with a special double-guessed episode with Amir, oh my God, I don't know your last hug yet. That's close enough.

Starting point is 00:04:00 That is good. That's good. That's really good. And Yining Zhang from Base 10. Welcome. Thank you. Amir, we've met before. You're co-founder of Base 10, which is, you know, one of the leading sort of LM-infrinse

Starting point is 00:04:14 platforms. I don't know how you, what do you consider yourself? That sounds fine. And Ineng, you are lead software engineer on the model performance team. And you guys recently shipped DeepSeek V3 as one of the many models that you do host. You also are very involved in SG Lang. And that was actually one of the reasons we're discussing an episode with you, even before Deep Seek V3 dropped as a Christmas present to everybody.

Starting point is 00:04:37 So we can take this a number of directions, but I think one thing we wanted to just get off the bat on was to start with Deep Seek and then we'll work our way backwards back to SG Lang. But Deep Seek is kind of more recent. Why are people so interested? What's the history of like, I guess, like deep seek in general from your perspective? Yeah, because deep seek theory, I think, is currently considered the leading open source al-lams based on the benchmark results and the chat area results.

Starting point is 00:05:09 And it's so big, you know, it's a 671 billion parameter MOE. And I think it's a game changer for the open source AI. So everyone is interested in this model. Yeah. One of the interesting things is like they are bootstraps, like, you know, very private small lab. Like they have a lot less fewer resources than others. But it's also interesting that it's just open weights. Like for some reason, the Chinese labs are much better than the American labs at sharing open weights.

Starting point is 00:05:43 And that's obviously beneficial for base 10. It is in your incentive to serve these models at all times. Like, you know, like what are sort of the unique challenges that you face, like, offering something this large? Yeah, I think because the model is very large. And if we use something like, use H100, we cannot serve this model. Because, you know, even we use the H108 cards, it should be 640 gigabytes memory. So, Deep Sik 3 model, it has 671 billion ways. So even use the FP8 precision.

Starting point is 00:06:19 you need, I think, 71 gigabytes for the weights. And you also need an extra memory for the KV cache. So it's not possible to run that on H100. That's why we chose H200 to run that model or use smarting node to run that model. Yeah, it's very challenging. And I think another one is that deep sick fee three, the weights is released was the FP8 precession. So if you want to run it, you should support that kernel. Because I think the default is the BF16 and even larger.

Starting point is 00:06:58 But if you want to run the FP8, you need to support the quantization. So I think currently even the TANRTIM doesn't support the FP8. So if we want to implement that feature, we should do some feature development for that. The last challenging part is that, yeah, if you want to do some debugging or some performance benchmarking, it's very hard. Why? Because the model is so large and the loading time is so long. Yeah. So it makes it more complicated for the developer to do some debug.

Starting point is 00:07:36 Is it complicated or just slow? I mean, you've just only mentioned loading time, but like... Yeah, loading time is slow. Yeah, you are right. It's not complicated. just go for more coffee. Okay, okay. Okay.

Starting point is 00:07:48 Can you maybe just give people a quick rundown of all the models you support on based on how it compares just on size just to, you know, people here, 671 gigabytes, but like, is that a lot more than other models? And then you mentioned the BF16, FBA. What's kind of like the usual that you see? And do you see any variation based on model size or anything like that? I think at the base term, something like Lama, 7db,000. is more common.

Starting point is 00:08:16 I think Lama 3 has released the 405 billion weights, but I think there are just a few users use that. So before the Deep SIG F3, I think we haven't encountered that issue for that so large weights. I think Deep SIG fee 3 is the first one so big model that we should use H-200

Starting point is 00:08:41 or use H-100 multi-nodes. And what's that big as a big? performance or why did people not use the 405B Lama? I think the, you know, what I hear from people is like, you know, the performance gains of like the 4 or 5B at inference times are like not worth it, you know, so the 70B is kind of the sweet spot. Who are the people that use V3? Are people that were using maybe the Lama 70B model and just want better performance?

Starting point is 00:09:05 Are people that are just experimenting? I think that's kind of the question that people always have. There's always a lot of excitement around open source models. But then maybe the question is like, what are they, really good for? I can answer this. Just observationally, the interest that we have seen, and some of this is running in production, some of it is just at the interest level, is generally not coming from folks

Starting point is 00:09:27 who are trying to upgrade from a certain open source model to deep stick V3. We're seeing it generally speaking from folks who are coming from Claude and are doing so either because, and here's, I'm going to give you a list of reasons, and generally the reasons are a combination, a certain combination of these. In no particular order, either it is they're being rate limited or the price is too high, or they have certain latency requirements or time to first token requirements for their use case that cloud cannot hit. Or they want to have full control over the model as opposed to running it behind an API

Starting point is 00:10:09 where the model underneath them can potentially change. and a couple of other reasons, but generally it's a combination of those. You mentioned the speed and some of these things. Do customers want to change the hardware? Also, like, you're referring to H-200 is kind of like the default thing. Like, do people come to you and it's like, yeah, I'd rather use a smaller system and can have worse performance or how do you work with customers on that? Generally, people come with certain requirements around the latency, the throughput, the cost.

Starting point is 00:10:39 generally they're not coming in saying, I want this particular GPU skew. At least like as we go upmarket and we're talking to, you know, foundation model companies, for instance, the things that are top of mind for them are those requirements, not a particular GPU skew. We're doing the different GPU skews not because we want to, you know, offer, oh, look, we have H-200s, look at us. We have migd the H-100s, look at us. It's not that. That is really because those are the tools to achieve a certain kind of time to first token for certain types of models or certain kind of throughput and scale or a certain kind of price per

Starting point is 00:11:21 million token or what have you per million images depending on the modality. And that's the reason that we're talking GPUs. I wanted to pick up a little bit on this FP8 thing. It seems like, you know, I think Noam Shazir started talking about sort of training natively quantized. And I think that's what Deep Seek seems to have done in, as at least they said in their paper. Is this a trend? Like, is the community settling on one form, one sort of numeric that everyone knows about. Tell us more about what you're seeing here in terms of like the training trends, those sort of model trends and the, I guess.

Starting point is 00:11:58 So I think a lot of companies as well, like together, they'll also release like quantized versions of the Lama models, right? for like turbo or light inference, you know, like just based on different levels of speed. Like, do you do anything there in terms of quantizing the models that you serve? I'll let you know, the sort of patterns around, you know, using FP8 in training. But I want to draw one distinction that I think gets to the latter part of your question, Sean, which is that unlike companies like, you know,

Starting point is 00:12:27 you mentioned together or fireworks or replicate, based on that doesn't provide a shared inference endpoint for the population. popular open source models. That is a product that we don't have on purpose. Those work really well. The shared open source, shared inference endpoints for open source models work really well. For situations where the user is saying,

Starting point is 00:12:49 hey, let me just call a certain popular model behind an API, pay by the token. That is not our average customer or the median customer. Our customers generally have their own custom models, very custom workflows, strict requirements around latency and time to first token and can't deal with noisy neighbor problems that like, oh, the API is slow because some other customer has been calling it a lot. Other requirements around infrastructure flexibility, around regions for latency reasons,

Starting point is 00:13:20 for compliance reasons. That's the side of the inference market that we capture. And at base 10, when you deploy a model, whether it's your own custom weights or an open source model, you get dedicated inference, dedicated resources, dedicated inference. And so when it comes to the quantization question, where it matters is that we would never quantize the model behind, you know, the users back and say, look at us. There's a faster Lama 70B that has been, you know, somehow quantized to be faster and cheaper. Our customers are coming to us with those requirements that I mentioned. But in particular, when it comes to model quality, they have strict

Starting point is 00:13:59 requirements. They would not be okay with us touching the weights, if you will. We have done things like speculative decoding with a couple of different ways, but all of those things, those methods guarantee the output is unchanged as opposed to quantization. So when it comes to quantization, we have built tooling that allows our users to quantize their models for the ones that we're working more hands-on through our Forward Deployed engineering team. We're working with them on e-vows as well to ensure that the quantized models are meeting their their requirements. However, this is all very much in conjunction with the engineers that are customers, as opposed to us doing it behind the scenes. Yeah, FP8 training is very interesting.

Starting point is 00:14:40 And I think the deep-seek team is the first one to use FP8 training for the large model. I think before that, maybe Lin-e.a.I, sorry, 0.1.aI. Say they use the E Lightning. They use the FP8 training and others, I think most of them use the BF 16 training and yeah, it's the game changer. And for us, because the FP8 kernel should be implemented inference, it used the blockwise FP8. And currently, even you use something like Kudau Kool-Blas, you cannot support that. So you should use something like treatment to implement the kernel or you should use something

Starting point is 00:15:26 like a catalyst to implement that kernel. So I think that's the challenging part. My theory is that this will pick up in terms of the models that people release. Like increasingly, it will not be a BF16. There was a bit of quantization. I was trying to look for the paper while you were speaking, but I couldn't find it. But there's sort of this ablation of quantizations paper that was out last year that showed that there is benefits to quantizing and sort of natively training all the way

Starting point is 00:15:53 until like six-bit, and then even smaller than that, maybe going too far. Yeah, I'm not sure if you know what people are talking about, but there's an interesting trend for sure. Yeah, I think even they used the FP8 quantization, the benchmark result is very good, such as something like GSM8K. The score is nearly 94.6.

Starting point is 00:16:17 It's so high, you know. I think it's higher than every other open-source LM, even the Lama 400,05 billion. So I'm going to move on a little bit in terms of like one other notable detail and then, you know, we don't have to speak too much about D.C because obviously we don't know that much unless we work on the team. But like, you know, another trend that they have is the fine-grained M-O-E. I think that there is this question about whether or not M-Oes will be more of a thing.

Starting point is 00:16:47 Basically, like this time last year, Mixed Rao was sort of kicking off a bit of an M-O-E trend with you know, 8 by 7B, 8 by 22B. And then the rest of the year, no MOUES basically. So is this like discovery of fine green MOUs going to be a relevant trend for this year? Yeah, I think so. Because as far as I know, some companies such as Baidu or Baidu dance, their internal dominant ALM, they use the MOE architecture. And their ways, I think it's similar to the deep seek MOUE model.

Starting point is 00:17:22 So I think after this new year, the M-O-E inference optimization will be very essential and, yeah, yeah, so important. At the same time, like, why hasn't, you know, the big labs done, we haven't the big labs done it, right? Like, I think, well, Lama 4 or 5B is dense. I think rock is also a dense model. You can correct me if I'm wrong. And, yeah, it's just a weird counter-tren. I think this time last year, I was kind of writing my recap and I was like, all right, M-O-E's, like, seem like they're going to be trending and then they did not trend. Anyway, so it's just just a note that I would

Starting point is 00:17:56 flag out there. But I generally agree. Like, it seems like fine grain of moye is working out and I would definitely see, definitely want to see more people adopting it. I went to Jeff Dean's session in Europe. And he also mentioned as well that I think one of the Gemini, I think Gemini, I think Jemite, which I don't think we knew before that. Yeah. Yeah. So the reason why Lama open source MOUs model, because I think they try to train M-O-E model, but they failed. So that's why they didn't open-source M-O-E model for Lama Serious. Why are the causes of failure?

Starting point is 00:18:31 Like, why do M-OEs fail? Like, I think this is another thing that people are talking about, right? Like, the failures of 3.5 opus, the failures of GPC5. Like, you know, it's a thing that people are sort of rumoring. Yeah, because if you want to train model, for the training staff, you need to some benchmark or some, yeah, score. But for the M-O-E model, the benchmark score. is even lower than the dance model.

Starting point is 00:18:55 So in that way or in that case, they think of the MOU model is worse than the dance model. So they didn't release that MOUE model. Okay. Well, then one more thing, I guess, maybe more commercially relevant. D6 API pricing is very competitive. How do you decide pricing in this kind of landscape with open models?

Starting point is 00:19:15 Yeah, so it goes back to the use cases that we serve. And again, going back to the fact that we don't have share. inference endpoints for the different models. And so our pricing is never per token. Customers, like I said, generally come with their own custom models or open source models, but with strict requirements around certain things, latency in time to first token requirements or security and compliance requirements or a particular scale that they're looking for without running into a noisy neighbor problems, things like that. And the way that we price has always been based on consumption based on consumption of resources. And that takes one of two shapes. One is the shape where

Starting point is 00:19:58 things are running inside of our infrastructure. And we're running, by the way, on top of multiple different public clouds, many different regions within those. Then we charge them based on the hardware that the resources that they're using. The second shape that it takes is that our customer brings their own cloud. And we're seeing this more and more where a customer has big committed resources inside of their AWS VPC or GCP or what have you. And in that world, we also have a consumption model. Of course, the price is very different because they're using their own resources, but we are managing those resources for them.

Starting point is 00:20:38 An example of that that's, you know, we're seeing more recently as the fact that we've had to build multi-cloud capabilities for us so that we can have a single model horizontally replicate across different regions and even different clouds. And more and more as we go up market, our customers have their own cloud commits. They are also multi-cloud in order to be able to get good prices and good capacity. And it's unreasonable to expect every one of them to build the same multi-cloud capabilities that we have built. And so then they take advantage of what we have built and use all of the different cloud

Starting point is 00:21:13 resources that they have as a whole holistic unit. and have their models at inference time horizontally scale across those and even optionally overflow to our cloud when they start running out of committed resources. All of that has a consumption pricing model to it. Can we talk about what it takes to actually run your service? So we had episodes with, you know, replicated model of our works. We always like to ask this question. Obviously, since you're not the model maker, all the secret sauce is in how you actually run the model. I know you also have Truss, which is your more developer-led SDK. Can you maybe quickly run people through, how do you go from taking the deep seek V3

Starting point is 00:21:54 weights to actually run it, what goes on behind the scenes? And then we can talk about SG Lang in depth a little more. Yeah, totally. So we have, like you said, we have Trust, which is our open source model packaging and deployment library. Trust works with different frameworks underneath it. It has very native and deep support for. Tensor RTLM, somewhat as an accident of history, we happened to have access to TRTLM before

Starting point is 00:22:21 it was announced, contributed back to it, and we still do pushed it to its limits and had to go beyond it in certain areas as well. So for example, if you know the Triton Inference Server, we've had to build our own version of that for performance and reliability reasons. But we invested in it heavily because it tended to be for the use cases that we were seeing from our customers, it tended to be the best framework to handle the latency and throughput requirements that we were seeing. In particular, when it comes to the kernels that they come with, I'm yet to see folks do better than what Nvidia can do when it comes to CUDA kernels. However, trust is not

Starting point is 00:23:02 tied to TensorFlow. For example, for the deep-sick example that you mentioned, is working with SGL, which is really cool to see. And we will be investing more and more on SGL, especially as the developer experience is just so much better than TensRTLM. We've built a lot around TensarRTLM, productize them too, to make it easier to work with, but still SGL has been a joy to work with. Another trend that is really promising, and I learned this from the SGLM folks, is that the Tensara TLM folks have promised to modularize a lot of TRTLM,

Starting point is 00:23:39 so that other frameworks, like SGL, can grab certain parts of it, and build on top of it. And so as a user, you don't have to go all in on one framework versus another. You can really pick and choose based on the requirements that you have. And that's really been our approach as well. We have customers on B-TL and we have ones that are using TensRTILA, and we have a growing number that are using as G-Lang 2. It's not about really tying yourself to one versus another.

Starting point is 00:24:09 It's about using the best of the bunch, depending on the requirements of the customer and for their inference workloads. How did you think about designing the framework? So replicate also a COG, which was kind of more tied to Docker. What were maybe some of the design decisions that you had? And how do you think that that's changing, especially as the models change and like the runtime's change? Yeah, totally.

Starting point is 00:24:33 So we started trust, gosh, like four or five years ago. And at the time, the sort of principle that we had, in mind was let's make sure that easy things are easy but hard things are possible. And so an example of easy things being easy is that, you know, think of it as a very simple, you know, you have a model. What do you need to do to serve it? Well, you need to load it up and then you need to write the code for the, for the inference path. And, you know, trust actually had, you know, hooks for these two things.

Starting point is 00:25:06 And so you could just, you know, write two functions and voila, your model was being served, at least as a single unit. We can talk about the horizontal scaling part separately. That's a whole different topic. And so we did well when it comes to easy things being easy. I think we struggled with the hard things being possible in the early days. Hard things, example of hard things are cases where we're seeing where more and more our customers have their own custom models, custom models that sometimes they fine-tuned,

Starting point is 00:25:33 sometimes they've pre-trained. You know, we now have six or seven foundation model companies as customers who are sophisticated enough to pre-trained their own models and they're trusting us with the inference layer. That's not a situation of, hey, here's two functions, good luck. I have to have much deeper integrations with them. And so that's where we started rethinking some of the abstractions of trust over time to allow for those custom use cases. And that has been successful.

Starting point is 00:26:01 Another place where we didn't think about at first but became important was seeing more and more use cases where the customer was saying, I can serve my models on Baystand using trust fine, but my use case is not just call the model, get the response, and run with it. I actually have a multi-step inference workload. So an example of that is the company bland AI with their AI phone calls. To make an AI phone call happen, you need to transcribe what the human said, a couple of LLM calls to figure out what to say back, and then text to speech to actually have the end-to-end workflow work. Now, you can have these three separate models, three separate deployments, but think about what

Starting point is 00:26:43 happens is that you have to call the first model, wait for the response, call the second model, wait for the response. All of that network back and forth is killing you. The latency is becoming too high. That's not something that we had designed for initially. And so that's when we came out with trust chains, which is the devX for building these multi-step, multi-model inference workloads, but doing so in a very very very important. very low latency way.

Starting point is 00:27:09 So that instead of you orchestrating all of these calls and incurring all of that network latency, you're actually making one call and these models are actually talking to each other. They run independently on their own hardware, with their own auto-scaling behavior, but the data from one to the next step is being actually streamed. And that way, going back to the AI phone call use case, you can get sub-400 millisecond latency AI phone calls that actually feel very realistic. And those are all models hosted on Base Stand? Or do you also do a change? Those have to be models hosted on BASTA.

Starting point is 00:27:42 And if one of those steps is not hosted on Base Stand, then you still incurring a massive latency on the network side. Yeah. And then just to maybe tie this into SG Lange, how do you kind of think about the hidden magic, you know? Should people know they use SGLang? Like should people care? Like especially for the people building the models, you know, like it doesn't matter to them. They use a certain model runtime, or do they not care? Everything just goes through the base-time platform the same. Yeah.

Starting point is 00:28:11 Should we talk about it? Yes, 100%. We want to be the transparent provider. I don't want to say, oh, just give us your model and voila magic and trust our magic. I want that magic to be very transparent to our customers. That has worked really well for us. And you really need that, especially when you're onboarding, you know, foundation model companies, and they're not going to just turn a blind eye on how things are run

Starting point is 00:28:37 underneath the hood. When it comes to customers caring about what's happening underneath, they do, but more than caring about this framework versus that, they care about how the final output, in other words, is the quality the same or somehow something has changed underneath the hood and the model isn't actually producing the same quality. How is the latency? And especially for certain use cases, what is the time to first token? And is that sustained? What is the P95 of that? What is the P99 of that? How well does it handle throughput? When you start getting a massive burst of traffic, does it still sustain those P95s and P99s? How do I make sure that the security of the data being sent into the model is guaranteed? How do I make sure compliance is guaranteed for HIPAA use cases?

Starting point is 00:29:28 And how do I make sure that the data remains within a certain geo for compliance reasons or for latency reasons? And so those are the things that those are the concerns that the customers are coming to us with. Let's so about, hey, here's my model. Make sure you run it with your TRLM or make sure we run with SGL link. Can you maybe give us an overview of all the different frameworks that people want to use? So you have SGLang, TRTLLM, VLM. Those are kind of like maybe the open source research ones. And then some of the other commercial companies are building some of their own stuff.

Starting point is 00:30:04 But what's the state of the art today? Maybe like the top three most popular. And then we can talk about why SG Leng can to be and what makes it different and some of the performance boosts that you get. Okay. Yeah. I think for the common use case, maybe not the DPSIC, fee three. For the common use case, I think SGLAN's performance is better than Firm and its usability

Starting point is 00:30:26 is better than TNRTLM. So when user care about the performance and the usability or the, I think they will choose SGLON. And for the DIP fee 3 case, because we do a lot of optimization in SGLAN, something like DeepSIC fee 2, they propose attention parent named MLA multi-latent attention. And I think SGLOA is the only framework supported that. Maybe Lightalm and TRTM also support, but FIM doesn't support. And also in DeepSeek, sorry, in the SGLon version 0.4, we also support the DPR attention for deep seek. And in the latest SGLAN release, we also support the Blockwise FP8 kernel.

Starting point is 00:31:15 And that kernel was adopted and copied by PMM later. So I think we have done a lot of optimization. for deep seek. That's why SGLon is the recommended engine by the deep seek team. Maybe one thing to point out, and I think this is important, is that the framework that you choose is part of the equation for running mission critical inference workloads, but it's only a part of it. So maybe I can draw this out just based on the experience, based on what I've seen in the market, as to what it takes to run mission critical inference workloads in production. I think it takes three things. And each of them individually is necessary,

Starting point is 00:31:58 but not sufficient. One is performance at the model level. So in this case, how fast are you running this one model running on a single GPU, let's say? The framework that you use there can matter. The techniques that you use there can matter. The MLA technique, for example, that Eneng mentioned, or the kuro kernels that are being used. But there's also techniques being used at a higher level, things like speculative decoding with draft models or with Medusa heads. And these are implemented in the different frameworks, or you can even implement it yourself, but they're not necessarily tied to a single framework. But using speculative decoding gets you massive upside when it comes to being able to

Starting point is 00:32:42 handle high throughput. But that's not enough. Invariably, that one model running on a single GPU, let's say, is going to get too much traffic that it cannot handle. And at that point, you need to horizontally scale it. That's not an ML problem. That's not a pie torch problem. That's an infrastructure problem. How quickly do you go from a single replica of that model to five to 10 to a hundred? And so that's the second, that's the second pillar that is necessary for running these machine critical inference workloads. And what does it take to do that? It takes, as some people are like, oh, you just need Kubernetes. And Kubernetes,

Starting point is 00:33:20 It's an auto-scale, and that just works. That doesn't work for these kinds of mission-critical inference workloads. And you end up catching yourself wanting to, bed-by-bed, to rebuild those infrastructure pieces from scratch. This has been our experience. And then going even a layer beyond that, Kubernetes runs in a single cluster. It's a single cluster, and it's a single region tied to a single region. And when it comes to inference workloads and needing

Starting point is 00:33:50 GPUs, more and more, we're seeing this, that you cannot meet the demand inside of a single region, a single cloud's single region. In other words, a single model might want to horizontally scale up to 200 replicas, each of which is, let's say, 2H100s or 4H100s, or even a full node, you run into limits of the capacity inside of that one region. And what we had to build, to get around that was the ability to have a single model have replicas across different regions. So, you know, there are models on Bayesian today that have 50 replicas in GCP East and 80 replicas in AWS West and Oracle in London, etc. And that was a big investment that we had to make. The final one is wrapping the power of the first two pillars in a very good developer experience.

Starting point is 00:34:44 to be able to afford certain workflows like the ones that I mentioned around, you know, multi-step, you know, multi-model inference workloads. Because more and more we're seeing that the market is moving towards those, that the needs are generally in these sort of more complex workflows. So these are the three pillars that it takes to run mission critical inference workloads. And the choice of the framework, the serving framework, is really a part of the first pillar. And that's something that I'm seeing, the market that, like, people who are somewhat new to it, they're like, well, VLM equals equals production. That's what it takes to run inference workloads. And that's in practice, that is not true.

Starting point is 00:35:25 And I wanted to call that out. I agree with them, yeah, because I think it's open source libraries such as VRM, SGLLLM, or TansRTRTM, they only provide a library. They don't provide a product solution. Yeah, can we maybe talk about some of the SG Lang unique things? So I've read through the paper. sounds like some of the main use cases, like when you have very large batches, which makes sense for your use case and also kind of longer context.

Starting point is 00:35:53 You know, what was the decision buying, creating the framework? Which I think is like around one year old. I think the paper came out of December, 23, something like that. So it's still still thoroughly new compared to some of the other ones. And then maybe what were some things that you had to change, you know, as you built it or any, any fun stories? Yeah, yeah, yeah. I think last year, or not last year, sorry, at 2023, maybe August, at that time, Viaming and Ian want to create the SGLLN, maybe for the front end, something like LLM program.

Starting point is 00:36:25 They want to solve that problem. And at 2024, January, they support something like Redix Cash. It's a prefix cache technology. I think SGLon is the first framework that support prefix cash. And at February, the first framework that support prefix cash. And at February, they also support concentrated decoding and support some jump forward. So at that time, it's a no for the language generator, not the inference backhand. And at 2024, July or June or July, we want to make SGLAN a fully functionality, LM inference engine.

Starting point is 00:37:04 It's just equal with equivalent with the VM or with the TensRTLM. So at that time, we published a blog compared with other frameworks, and its performance is amazing. Yeah. At that time, I think its performance is maybe three times. It's throughout its point is three times than FM. So after that, FIM also do some refactor to make it faster. And at September and December, we continue to release new versions for SGLAN. Yeah, we will support some deep-seek optimizer.

Starting point is 00:37:38 such as MLA optimization, DPR tension optimization, and we also support the Sierra overhead CPU schedule. Also, we support something like SGLON router for the cashier wear, load balance. Yeah, we deliver so many features. We just build and the ship. And I think why Lieming and Yin want to create a new framework rather than use the existing solutions such as Fium or TenserRT, because at that time, you know, at that time for the 3M, I think it's easy to use,

Starting point is 00:38:12 but its performance, maybe it's not good. Some design, I think it's not okay. Yeah, maybe the code is a little messy, and if you want to extend some new feature, on top of that, it's a little hard. And 1030RTM, I think it's blazing fast. Its performance is so good, but it's not easy to do some secondary development. If you want to add some new feature, it's a little hard. They just think about how can we create a new framework?

Starting point is 00:38:42 It can achieve the good performance. Also, it's easy to develop to maintain. So that's why they create the SGMUM project. Let's run through maybe the three main techniques behind SGELang. So the first one is RADX attention, which focuses on KV Cash. And when you think about a model that is, you know, as large as deep seek V3, especially like having better KV cash reutilization. It's great.

Starting point is 00:39:08 Can you just talk a bit about that performance impact? Yeah, Reddick's cache. I think it's the technology of the prefix caching. And it's a special case for something like block size is one. You know, for VM or for other frameworks, they use something block size 32. And SGLL use the block size one. I think if you use the block size one,

Starting point is 00:39:31 you can make the cash heat rate higher. than other frameworks. I think that's the many benefits. And for your case specifically, how does that change when you have like a base 10 type use case where you don't, do not have a share endpoint versus like, you know, is this less helpful

Starting point is 00:39:49 for GPU clouds to do one model for like many people that have like very different use cases versus like when you have just one endpoint for one customer? I'm sure to have a system problem that a lot of models share and things like that. Anything you want to mention there? Yeah. We've seen this be massively helpful for the reason that you mentioned.

Starting point is 00:40:10 There is a certain sort of finite number of prompts or at least prompt prefixes that are being used per customer. And what we've seen is that prefix caching and the different techniques to make that better has been massively helpful. However, we still had to build on top of that. The example there is that you have a model with dozens of replicas, each of which has its own state of KV Cash. A new request comes in, and what we used to do back in the day was that that request would be randomly assigned to one of these replicas. But the better way to do it is that knowing the state of KV Cash in these different replicas, trying to decide which one it should go to. One of the parameters that you need to consider, there are other parameters to consider around the size of the queue at each of the replicas and the location of each replicas, depending on how geo-aware you want to be. But adding that additional consideration around KV Cash-aware load balancing, well, was something that we saw improve latency quite a bit for our customers.

Starting point is 00:41:23 And then the second part, which was maybe the harder one to understand as a practitioner, which is this idea of like turning some of the decoding process into a finite state machine instead of a more open-ended. When you're using, especially with structure outputs, can you maybe explain what that means? And I would love to learn too. So maybe this is an opportunity for everybody to better understand how you think about going from a normal kind of like token by token decoding to having a more. I wouldn't say pre-compile, but like pre-understanding of what the paths are going to be. I think SGLO support concentrated decoding, and it also support jump forward. And we use something like outline or the X grammar to do the something like change the comfort of the schema from JSON to the FSM, the state machine. And we can use the state machine to control the output, something like the output may be to the JSON mode or something.

Starting point is 00:42:22 like, it should be obey some rule. So in that case, because the output should obey some rule, so you can skip some token, something like you should decode four times, but you should obey that rule. So, or you can get that token in advance. You can just use one preview to replace the four decoding, yeah, for example. So that's why you can jump forward. I guess the question is like, why doesn't everybody do that? When I was reading, I was like, this just sounds better, especially both for accuracy.

Starting point is 00:42:55 You know, you're kind of constraining for structure output as well. You can do faster decoding. Are there downsides to it as well? I think maintain jump forward is a little hard. Yeah. At later, we support something like CPU overlap. I think in the overlap model, we even make it compatible with the jump forwarding. Because just if you want to maintain the jump forward with other features, it will be more complicated.

Starting point is 00:43:21 So I think we only use it a fault setting. We disable it by default. But if you want to enable it, you can just use some arguments to enable that. But it's a little hard to maintain, especially compatible with other optimization features. Just as a side note, you mentioned Xgrammer, which I never heard about, and I looked up the GitHub repo. It's actually from MLC, which we talked to TQ. Yeah, yeah, yeah. I think a while ago.

Starting point is 00:43:47 Any comparisons between X grammar and outlines? Is there a trend in these? in this world or is it mostly settled science? To be honest, I prefer ex-gramma, you know. Okay, yeah, tell us. Yeah, M-IOC AI is funded by Tianqi, both Tianqi and ex-grammas or the Yishindong. They were the students graduated from Shanghai Geochalun University. And the creator of S.GLang, Lianin and Yin, they also graduated from Shanghai's

Starting point is 00:44:19 Oh my God. Is it the Berkeley of China? Yeah, yeah, right. And I think X grammar's performance is better than the outlines. And also in the TansartRTLM's latest release, TenseRTM also integrate XRama as the backhand for the concentrated coding. Okay, this is new to us. We had Remy from Outlines speak at my past conference, but I didn't even, I wasn't even aware of X grammar being. being a thing. But yeah, I mean, structured output is something that a lot of people care about. We had OpenEI talked about their structured output implementation, and there's a lot of interest in making sure that there are no tradeoffs. I think there's a little bit of fud around how maybe the models are dumber when you use structured output instead of the sort of base

Starting point is 00:45:09 sort of next token generation, but I don't think it's significant that much. Yeah. We can talk about the last one, which I don't know if it's as relevant for base 10, which is the third technique of SG language is API speculative execution, which seems to be only for API-only models. Oh, yeah, I think it's the front-end feature. It's not the backhand. Yeah, something like you have some control flow for the LM task, such as you have the one request to get a result,

Starting point is 00:45:38 and you just continue to another call. And for this case, you can use the SGLAN front-end language to describe the control flow. it will make it easier to control that pattern. Okay, awesome. tracing this human path, I'm pretty sure I know the answer, but is there a reason that big projects like GROC, you know, like X-AI also use SIG-Lang?

Starting point is 00:46:01 Yeah, yeah, yeah, right. Is it just the same people? Yeah, yeah, yeah, right. Lian and Yin are the X-AI's member of technical staff. I mean, it makes sense. I wonder if it, you know, what's the impetus for S-G-Lang to kind of break containment. It seems like VLM obviously has the,

Starting point is 00:46:21 it's one year older, it has more community pool. I wonder how this will shake out. I don't really know. But you know, you said it's a library. You say VLM's library, but SELN is much more comprehensive. I mean, do people care? Maybe it's like when you're serving models at scale,

Starting point is 00:46:39 then you start really prioritizing the sort of performance that S.E Leng offers. I think if you care about, the performance. Maybe TensorFlow is the best solution for now, especially for

Starting point is 00:46:53 the latency sensitive scenery, Tanzan RTM doing well. But if you also want to implement some feature by yourself

Starting point is 00:47:01 or do some optimization by yourself, you want to customize the framework. I think SGLon is a good option. And VM, I think FAM's

Starting point is 00:47:11 community support is very nice because it was used by so many users and it It has so many GitHub stars. And, you know, SGLAN, when I participate in the SGLO team at July, it has only 2,000 stars. And right now, it has more than seven stars.

Starting point is 00:47:28 Yeah, I think it also grows so fast. Anything that people should look forward to that's on the roadmap for Shilang that people should be aware about? Yeah, yeah, yeah. And we post the roadmap in the issue, and we pin that issue. Also, we have biweekly meeting to. think with the community about our progress, our plan, which feature do we want to implement in this quarter, something like that. And we also co-host some meetups, something like the first meetup we co-host with the ML, CLM and Flash Infer. And we also, yeah, participate

Starting point is 00:48:06 at some hackathon, something like Camor AI hackathon. We do the presentation about SGLAN, yeah. I just saw it now. It sounds like actually that there's, you know, we mentioned Eagle and mentioned Medusa. I think Amir mentioned Medusa, but Eagle is also part of that. Cabal's speculative coding techniques. It looks like not yet supported. Yeah, we already support it. And I think in the open source implementation, something like FIM, SGLO and other frameworks, I think it's the SOTA performance. And currently, even use the 10th RTM, it only supports Eagle 1, not Eagle 2. And one thing to note about speculative decoding different versions of it is that,

Starting point is 00:48:47 But the framework supporting it is one thing, but you will have to do the job of the training of the draft models or the additional heads. And a lot of the benefits will come from how good you are at the training aspect in terms of the data that you use to train the draft model, to essentially distill the target model or mimic its behavior so that you can have a very high. rate of acceptance. It's the kind of throughput improvement that you get

Starting point is 00:49:22 are ultimately dependent on how good of a job you do at training the draft model in the case of the draft target model mechanism. So that's another thing that's like, does the framework support it or you can just turn on speculative decoding with a flag? That's

Starting point is 00:49:39 not the case. There's more that goes to it. One more sithing on training, I also noticed that with OpenEI offering fine-tuning for 01 and all these things. I think people are also very interested in sort of RL trainers is what you have here. It looks like you're supporting Hanging FACE TRL and Open RLHF. Do you think that this will become something that a lot of people are demanding, like the sort of general feel of RL for LLM's relatively abandons, I think, up till like the end of last year,

Starting point is 00:50:08 basically? Yeah, I think that's all. I don't know. It's like it's one of those things where like maybe. people have to wait for a base model that has some layer looping or some other sort of friendly architecture for reasoning instead of just pure RL on LLMs because I think so far people have not really exploited RLHF as much in the wild. I mean, correct me if I'm wrong. Yeah, so I can give you some examples of when we've seen at work, again, this is generally done by our customers

Starting point is 00:50:43 before they come to us for inference. But there's examples like in the healthcare world, fine-tuning models for understanding medical jargon, like for a whisper, for instance, you know, a version of whisper that I can actually understand medical jargons. That's not, that's non-LM use case. In the LLM use case, staying in the healthcare space, models that can do medical document extractions

Starting point is 00:51:08 and do a very good job compared to even, you know, state of the arts, because of the data that the company had gathered through human in the loop. Are those going to go away? The need for those are going to go away because there's a model that can do reasoning and do a very good job at it? I don't know. My intuition says, yes. Will it be cost effective? That's the question that I have.

Starting point is 00:51:31 And in the short term, no, in the long term, maybe. But I haven't seen the need for the more traditional fine-tuning actually go down. In fact, we see that quite a bit right now in the market. The question for us is, do we want to address that market, knowing that the entire market might go away one day? And my general answer to that is, let's solve today's problems. Even if they're not around in two years, you will learn a lot along the way by onboarding customers that have today's problems. And you'd learn from them about tomorrow's problems and you will build a head for him. What, what's, hang on?

Starting point is 00:52:12 Why do you think fine-tuning might go away? Because, like you said, there's going to be models with complex reasoning capabilities that can actually figure it out in a, you know, few-shot kind of way without needing a large dataset to fine-tune the model with. That's what some people are saying. I really have a trouble believing that that'll be the case. Much more believe that it's just easier to change your prompts. rather than actually do full fine tunes or even parameter efficient fine tunes.

Starting point is 00:52:45 Is there anything else that we haven't touched on that you wish the people ask you more about because it is something that is very interesting from your point of view of what you're seeing among your community. Yeah. When we release the deep stick fee stray support, we have some community user, something like Kursor. Do you know Koso? I think it's very popular. Of course. Of course.

Starting point is 00:53:08 I use them every day. When I type code dot inside of my ID, my terminal, it actually opens cursor instead of VS code. I feel very bad for VS code. And when we release the deep sync fee storage support, some employee from the cursor team, also very interested in our implementation and reach out and ask some questions from us. So I think as SGLON grows faster and the features optimization, we either. iterate so fast. And I think there will be more users from different companies, from different teams to use it. Honestly, I would go back to what I emphasized earlier, which was that I wish more people asked about what it takes to run mission critical inference workloads. Because I see

Starting point is 00:54:01 this in the market sometimes that they're like, well, I can just use VLM. And that puts my model behind an API and that is that is production but really it takes three pillars that all need to be there one is performance at the model level that is where you know frameworks that we talked about today really help you with but you still have to guide them like when it comes to you know speculative decoding yes they support it but you know who's going to train or or fine tune the draft model or the medusa heads or who's going to ensure that the reliability of the VLM server that you see it in production, like, you know, there are crashes. How do you recover from those without affecting production traffic?

Starting point is 00:54:45 But by itself, that's not enough because invariably that one model running on a set of hardware is going to get too much traffic that it cannot handle, and at that point, you need to horizontally scale it. And that's not an ML problem. That's not a PITRs problem. That is an infrastructure problem to ensure. ensure that you can horizontally scale up your model extremely fast to meet your P90, P99, latency requirements, and to ensure that you're not running out of capacity in a single region

Starting point is 00:55:19 that that model lives, you end up having to scale that model across different regions and across different clouds even to ensure that that model is not being starved of resources in the one place that it lives. So that's an area of investment that, We started investing in some time ago and really paid off this past year. And the third pillar is enablement of workflows, workflows such as the sort of AI phone call example that I told you about, the ones that require multi-step, multi-model inference, but in a very low-latency way.

Starting point is 00:55:58 That's a third pillar that really allows the developers to be able to use the power of the first two pillars and then be able to combine them when, especially you need multiple models for your workflows and doing so in a reliable way, a repeatable way, and a low latency way. And that is, those are the three pillars, honestly, that we have been investing a lot on, some of which we started investing in three years ago and it really started paying off, you know, a year ago. So it takes quite a bit of build to get it to the point where you're, truly running folks' customers' mission-critical inference workloads.

Starting point is 00:56:38 What do I mean by mission-critical inference workloads? Inference where if inference is slow or down, the main product of our customer is slow or down. So they really care about it. They have strict requirements around latency, around being able to support large throughput, about being able to do so in a way that other customers' usage doesn't affect the SLAs that they are getting, dealing with noisy neighbor problems, an inference done in a compliance way, whether it's HIPAA or certain SOC requirements, and also inference done in a geo-aware kind of way,

Starting point is 00:57:17 both for compliance reasons and also for latency reasons, where where you forward the traffic has an impact on latency in situations where 50 milliseconds really matter, 100 milliseconds really matter, and we're seeing more and more of those use cases. Well, one way I would recommend doing that is kind of a manifesto type of thing. I'm sure you know Heroku's 12-factor app. I've seen that. Yes, yes. That's a good idea, actually.

Starting point is 00:57:42 Yeah. Maybe even put it on a separate property than Base 10 and just go, like, here's what we think, you know, mission critical A application should be and, you know, have some thought leadership there and flesh it out and, you know, see if the market takes it on as a mission. And obviously, you will be best prepared to serve that market as well. I've also seen this done very well with Enterprise Ready.io. I think it used to be done by, I think it was called gravitational or replicated. Yeah, yeah. From a replicated. Yeah, yeah.

Starting point is 00:58:11 These kinds of things, when you have a list of requirements when you're like, look, everybody needs this. Okay, like write them up and then, you know, put a little bit of marketing on it, spit it out from the main company brand. And that tends to work very well. Yeah. Good idea. Cool. Well, thanks so much for your time.

Starting point is 00:58:26 I think this is a really good dive into both face 10 NSG Lang. and a little bit of Deepstick V3, which people are very interested in. I'm trying to talk to them as well because obviously they're a fascinating lab. But I think you guys are doing a lot to make it accessible for everyone. So thank you so much. And just to get based on some street cred, you know, they were one of the first sponsors for latent space events. And Amir brought 100 croissants at our Latens space hackathon in 2023.

Starting point is 00:58:55 So yeah, just want to bring that up always tell. I saw Phil. and at Ablias Reinvent and I told them that was one of the first events that we really did and one of the turning points of this industry as far as community goes in my mind you know everybody everybody everybody was there the croissons no no not the croissant the event itself yeah entire company's launched yeah you were i mean you know like a natter from brev was there and did the with joseph from roboflow they did like the prom battle thing like harrison was a judge and it's like Jerry from Lama Index was there, like kind of like everybody that is kind of now breaking out.

Starting point is 00:59:30 If you look at the graph that Jensen put on on the screen at CES with some of the companies that work with, a lot of them were at that event. So yeah, thanks for staying involved with us, Amir, and I'm sure we'll do more together. And thank you guys. Many more years to come for sure. Thank you for taking the time today. Good to see you both. Let's see you, Sean.

Latent Space: The AI Engineer Podcast - Everything you need to run Mission Critical Inference (ft. DeepSeek v3 + SGLang)

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.