a16z Podcast - Inferact: Building the Infrastructure That Runs Modern AI

Starting point is 00:00:00 Our goal is to make VOLM the world's inference engine, really push the capabilities on the open source front, and then builds a universal inference layer. That means we'll have the runtime to power any new model on new hardware for new application, be able to tailor that to extreme efficiency and support all the AI workload going forward. I fundamentally believe that open source,

Starting point is 00:00:25 especially how VLM itself is structured, is critical to the AI infrastructure in the world. And what we want to do with Infraq is to support, maintain, steward, and push forward the open source ecosystem. It is only that VOLM win. VLM becomes a standard and VLM help everybody to achieve what they need to do,

Starting point is 00:00:48 then our company in the sense have the right meaning and to be able to support everybody around it. What if the hardest problem in artificial intelligence isn't training smarter models, but simply keeping them running. For most of the history of computing, once a system was built, the hard part was over. He wrote the program, pressed run, and the machine behaved predictably. Even early machine learning followed that pattern. Inputs were standardized, workloads were regular, the computer did its job and stopped.

Starting point is 00:01:23 Large language models quietly broke that assumption. Every request is different. Prompts can be a sentence or an entire archive. Outputs can end instantly or stretch on indefinitely. Thousands of users can arrive at once, each making incompatible demands on the same hardware. And all of this has to happen in real time on GPUs that were never designed for this kind of unpredictability. Over the last few years, this problem has moved from obscure to essential. As models have grown larger, more diverse, and more deeply embedded into products,

Starting point is 00:01:53 the challenge of running AI systems has started to rival the challenge of building them. That's why the tension lies. A public story of AI progress is about better models and bigger breakthroughs. but underneath it is a quieter systems problem. How do you schedule chaotic requests efficiently? How do you manage memory when you don't know when a conversation is actually finished? And what changes when AI systems stop behaving like single-turned tools and start acting like agents that think, pause, and interact with the world over time?

Starting point is 00:02:21 This episode focuses on the hidden layer. We examine why inference, the act of running trained AI models, has become one of the most complex and important problems in modern computing and why open source infrastructure is increasingly central to solving it. Matt Bornstein, general partner at Andresen Horowitz, is joined by Simon Moe and Rousa Kwan, co-founders of Infraact and creators of the open source inference engine VLLM. This is a conversation about the infrastructure beneath AI

Starting point is 00:02:48 and why it may matter more than the models themselves. We are here today with Simon Moe and Wussau-Kwan, lead contributors on the VLM open source project, and co-founders of Infraact, a new, AI inference company. Super excited to have you guys on the show today. Thank you. Thank you so much for coming. We're going to talk a little bit about VLM, the Open Source Project. We're going to talk a lot about inference and what inference technology really is. And then we'll talk a little bit about interact the new company. So to start, can you talk a little bit about where VLLM came from? What is it?

Starting point is 00:03:22 How did you start it? And why is it such an exciting project? Thank you for having us. VOM projects started from actually Wusuk's a pro-type project at UC Berkeley doing his PhDs and grow into today's open-sword project on GitHub for inference around time for everybody. Maybe Wusk can talk a little bit about the page attention paper. Oh yeah. So basically, I think it kind of started in 2022 when Mata released the OPT model as open source. I'm not actually sure how many people actually like remember the model nowadays, but it was kind of one of the first open-weight larger-language models to reproduce GPD3.

Starting point is 00:04:02 And our lab tried, created a demo service to run the model and to, you know, demonstrate it for the broader audience. And yeah, like, it was working, but super slow. So I started a small side project to optimize that demo service. That was kind of at the beginning. And initially, I was thinking that it may only take, like, a couple weeks to optimize the service and to when. But it turns out that it actually has a lot of open problems inside the net,

Starting point is 00:04:30 because this auto-regressive language model is pretty different. Actually, it was pretty different from other traditional, like, ML workloads. And it wasn't actually, it was kind of like a brand new, at least like outside this Frontier Labs back in the day. I started to work on it, and it became a research project, and then we wrote a paper, and it even became like an open-source project, pretty well-defined open-source project,

Starting point is 00:04:54 as more and more people got interested in it. So, 2022, this is pre-GPT4, obviously. Yeah, pre-ChatGPT. Yeah, pre-ChapT. Yeah. And you're thinking like, oh, I'll just work on this inference server. This should be a fairly straightforward problem. Like four years later, actually you're like doing more work instead of less.

Starting point is 00:05:14 Exactly, exactly. Yeah. Why did you think this is a meaningful problem to work on at the time? Because, like, I would say most people in the world at that time saw GPT3 as a curiosity in some sense. And OPT was kind of like a curiosity attached to a curiosity in a way. Like what made you and your lab mate sort of excited to work on this back then? I think I also started from curiosity.

Starting point is 00:05:35 I didn't really think it's the most important problem in the world back in the day. I just wanted to have a hands-on experience on how this actually works. I mean, I think I'm also impressed by the size of the model. The OPET largest model has 175 billion parameters. And that was the largest model available. So it's kind of like a meaningful for me. It was kind of pretty rewarding to work on such a large model. This reminds me of when I was like growing up, you know, we would build like computers.

Starting point is 00:06:01 That was like the cool thing to do. And each step change in like memory capacity was such a big. I was like, oh my God, this one has four megabytes of RAMs. Oh my God, this one has 512 megabytes of RAM. Looking back, it's silly. But at the time, that was actually like maybe it's because we're like nerds. But like it gets, you get like emotionally excited about like the numbers getting bigger on these systems. Right, right, right.

Starting point is 00:06:21 yeah, I think that was like one of the main motivation clearly. And so you started to say the sort of technical problem is different for auto-rogressive transformers compared to traditional machine learning. Do you mind explaining a little bit, you know, how that is? And even compare just to normal kind of computing workloads for, you know, listeners who, you know, our engineers may not be familiar with AI workload. So basically compared to the traditional workload, you know, the clear difference is definitely like GPUs, right? Now all the computers or kind of moves up the computers, or kind of most of the computer happening on GPU. And we have to optimize for the, which presumably have last memory

Starting point is 00:06:58 than CPU, at least back in the day. Now, like, GPUs are much, has much larger memory, but typically it has much smaller memory than CPU, maybe still. And like, you know, like all the computations happens on GPU. So you have to write program in a different language and different type of parallelism in mind. Yeah. So that's kind of like a fundamental different from the traditional like compute happy workload versus like deep learning workload, I would say. But within the dim learning workload, there's actually still a huge difference between the kind of traditional dim learning workload versus like larger language model inference. So for traditional workload, I think the biggest kind of characteristic is that it is pretty

Starting point is 00:07:39 like static, which means, for example, like for image models, like back into the day, like CNNs, like what people do is, you know, you may have a different, several images with different sizes. Then what we do is we resize them or crop them into the same size and then we batch them and then we put it to the model to run them to run the inference at once. And this is basically, yeah, because of this resizing and propping, like they all kind of, at the end, they're kind of compressing to the same size tensor. And that actually makes things much simpler for, you know, for the GPU to handle, right? All the shapes are pretty regular static and kind of like well-deensure.

Starting point is 00:08:20 find. But for a large language model, if you think about it, they're pretty dynamic. You know, your prompt can be either like, hello, like a single word, or your prompt can be a like a bunch of like documents spanning like hundreds of pages. And this kind of like dynamism exists inherently in the language model. And this makes things a whole like kind of into a different world. Like we have to handle this dynamism as a first class citizen. And yeah, back in the day, that was not like people didn't have a clear idea about how to handle it and yeah fortunately we were the one of the first to yeah to solve to see the problem that's very interesting so so kind of regularizing a batch of inputs was it sounds like one of the first problems that you had to solve it's actually more

Starting point is 00:09:09 about scheduling and memory management yeah yeah yeah as well yeah yeah so the problem we're solving before in all the serving system is about just what we call micro-batching. To leverage first CPU's fundamental like vectorization in the early days before LMs and then early GPU for vision models like Resnet

Starting point is 00:09:30 is all about micro-batching. You put together four requests together that arrive around the same time. But the change in the LM world is you always have requests that continuously filling and coming in and then each request looks differently. You just cannot really normalize them. So that's why

Starting point is 00:09:46 you have to have a notion of a step within the LM engine to process one token across all the requests at the same time, regardless of each request having different kinds of input lens and output lens. Nowpoo is also non-deterministic. The language model itself will decide when does it stop. Instead of in the traditional sense of other machine learning servings, it's very much like work like a clockwork. And here it is very sarcastic, it's always flowing, it is always continuous.

Starting point is 00:10:15 why scheduling is the first problem to solve, and then memory management. That's where page attention come about is a second problem to solve. So when did you get involved in the project, Simon? Well, I got involved as in around 2020, I first, we should have, first, we should have called in the Skylab Slack channel to say, hey, we need someone to work with us on this page attention paper and kernel. Actually, surprisingly, I was on spring break and I was like, look, someone else can do this. Let me just play with GBT for the entire week. So I would just end up just playing with prompt engineering. So I actually didn't end up joining with Lusuk.

Starting point is 00:10:50 And so this is what a vacation looks like in Janstoyka's lab, playing with models for a week. And he's playing with kernels. Yeah, exactly. So he's playing with kernels. I was trying to build more prompt engineering and explore like different kind of early agentic workflow. And then over the summer,

Starting point is 00:11:06 and especially this is when around August and September, and we really get to work together. Actually, this is where you come in. We get to work together on our very first VLM meetup, A-SQKKNZ, and where I had the experience of managing open-source project before, as well as deeply interested in actually building a serving platform and into a fully open-source project. And this is where I start to get involved,

Starting point is 00:11:34 right through my first lines of code, and sort of build out the CS system, build out the performance benchmarking systems, and then really much work with, took ever since. I had forgotten about that. So this was the very first VLM meetup, right? Yeah. I was in this office. In this office. I'm the exact floor. I think we are previously anticipating just 10, like 10, 20, maybe 50 people showed up. And then the registration was like exactly over the anticipated capacity. People are extremely interested in this technology. I remember that very well because

Starting point is 00:12:06 we run events here for ourselves. And it's always very hard to get people to show up. We're always scrambling. And instead, I got a call from our security team saying, too many people have been approved for this field and meet up. We need to scale it back. This isn't safe. I'm like, oh, okay. Probably don't tell. I don't think we ever scaled it back. So don't tell the security. It was quite crowded. The piece I ran out, like the first 10 minutes. But this is a big deal, right? Because this is not like a consumer app, right, that you were building. This is pulling from systems engineers, right, for the most part, who want to learn about how to serve LLMs and contribute So it's actually a big deal to get, I think, so much interest from such a kind of narrow, sophisticated

Starting point is 00:12:45 group of people who don't like meeting other humans in real life that often either, at least speaking for myself. So can you talk a little bit more about the community behind BLLM? Like, how big is it now? How did it come together? And like, how do you guys manage it as it's gotten big? Yeah. So in the beginning, of course, it's just a few grad students working on it. And then so, but over time, we started to be having this very much open-minded and developing being the open kind of mindset. So as of now, we're looking at 50 or more regular full-time contributor who open up GitHub every single day to work on V-O-A.M.

Starting point is 00:13:22 Way across 2,000 contributor bars on GitHub, one of the fastest growing top open-source project ranked by GitHub itself. And then this is really a diverse community. So there is folks like Usuk and I are sort of the team from UC Berkeley from grad student days and as well as meta and Red Hat pulling their way behind this open source project. And then as well as, of course, people who are not just making the model, Mestrel and Kwan team, and of course, like anyone who's making an open way model, are participating in our community.

Starting point is 00:13:57 And then on the model side, Nvidia, AMD, Google, AWS, Intel, they're all having their own participation and be able to support the ecosystem. So everyone in VR and using VILA has the ability to choose about different SETAns for accelerated computing. That's very interesting, though, which I think is a property that many successful open source projects have, which is that people aren't all contributing for the same reason. Some people, I'm sure, just love the technology. But it sounds like you're saying the model providers actually have incentives to contribute to the project because they want their models to run well.

Starting point is 00:14:29 The silicon providers want it to run well in their silicon. The infra providers want to have first divs on running it so they can sell infrared. that kind of thing. Yeah, this is kind of a classic worth solving the M-times-N problems so that as a model provider, you don't need to talk to everybody. And as a hardware provider, you can just go into this one system and then magically you'll work for all the models out there in the world. And then for applications who are using VOLM as well as infrastructure,

Starting point is 00:14:56 building with VOLM, like having a common ground where everybody can participate in and then innovate together is way easier and cheaper, in fact, in the end, to deploy. What's your philosophy for managing a pool of contributors this large? Do you tell them what to do? Do they choose themselves? How do you maintain high code quality? It's a constant sort of iteration months over months, year after years. So for this, I have to go back to my previous open source project, which I was working on a project called Ray, and then later any scale, where we have this kind of, where I learned this a community-driven approach in a way that have a clear requirements, have a clear roadmap, have a clear sort of milestone being set.

Starting point is 00:15:39 So we kind of try to borrow that, but also really study this really successful open-source project out there. I went all the way back to NNX and then study Kubernetes, study Postgres. How are these communities operating together? So in VALM we had kind of a special model that we do like any normal engineering organization, set clear team scope, but also clear objective and results and milestones with different kind of technologies, technical features we want to push forward and build. So this is where we have set forward our vision every quarter. And then, but also invites the community to contribute.

Starting point is 00:16:17 So we're saying, great, we're working on these. Also need help on these items that we don't have anyone actively working on. If you are brand new and want to engage with us or engage with the community, here's what you can work on. And then additionally, we keep an extremely open mind to all the GitHub pool requests that people just opened up that we're seeing, oh, is this a good request? Is this a good feature? And then as well as a request for common processes. So kind of is a blend of all the lesson learned from previously other open source project.

Starting point is 00:16:48 And then co-quality-wise, code reviews, but also a lot of constant refactoring iterations. Yeah, yeah, I do a lot of refactoring. Every six months, kind of, yeah. And actually, one thing to add is, you know, like, we do in-person meetups, you know, like every two months. And we're kind of expanding to globally, actually, like sometimes in Europe, sometimes in some other places in Asia. Yeah. And, yeah, like, we actually from the first meetup in 18-Z, we learned that it's actually super, super useful to meet, you know, those, like, collaborators and, you know, users in person. And, yeah, we are continuing doing that.

Starting point is 00:17:24 It's funny. It's another one of these lessons that, like, you know, Silicon Valley engineers, like we've gotten so kind of like, you know, high up the abstraction stack that we're like relearning, you know, lessons from a thousand years ago. Say, no, it turns out in-person communication is high bandwidth and doesn't suffer from consistency problems. So around the time you guys did that first meetup, we also made grant funding to the project through the academic lab. I think it was a small amount of money, but it was actually the very first open source grant that we made. So it's super, you know, just like fun and kind of gratifying for us to see like the money was actually put to good use and the project grew massively. And then we even had a chance to invest in the related company later. However, I did hear a rumor that at the time that we made the grant funding that you guys put a portion of the money into invidia stock.

Starting point is 00:18:11 Can you confirm or deny? No, not him. So someone else in the recipient list. So you probably turned our tiny grant into 10 times as much money before. Your portfolio. Well, sort of the funding for VOLM. A lot of these funding for VLM is that we set aside for project development and sort of project development, testing, and everything around operating this project.

Starting point is 00:18:36 And once you know, we're actually super grateful for the first grant, it's actually kicked off a culture. And nowadays, you can get even a tradition for people really opened up to sponsor open those projects in a quite significant way because running VL on our CI bill, for example, is more than 100K amounts. That could be tiny for some folks. And it's like overgrowingly over time. This is we're at a burn of million dollars amounts.

Starting point is 00:19:04 And a year, a million dollar a year. For an academic project, it's actually very good. Yeah, because we want to make sure every single commits is well tested. And then this is something that people are going to deploy at not thousands, but potentially millions of GPUs. across the world in different environments. So we want to make sure it's well-tested. It is reliable.

Starting point is 00:19:23 And then this requirement, this infrastructure, or right now all comes from contribution and sponsorship and from everybody are chipping in to help on this project. And now, of course, we also run meetups and sometimes expenses associated with meetups are directly leveraging the grants that you all provided. Yeah, I mean, it makes sense. for us and for other corporate sponsors of the LLM,

Starting point is 00:19:50 it benefits the whole ecosystem, right? So I think it makes a lot of sense. Let's talk more about the technical aspects of the problem, if that's okay with you guys. Do you mind to start just defining exactly what, like, an inference server or an inference engine is? Sure. So an inference engine turns,

Starting point is 00:20:06 it takes a already trend model. So this can be a very small model like 1B. It could be a very big model on deep. or Kimmy K2, run it on a cellular computing device. And its job is to fully utilize the computing device to be able to generate text and images and videos, essentially, but this all got tokenized into individual tokens. So the goal of inference engine is to produce, the goal of inference engine is to run the model at highly efficient speed to make sure that we can produce maximum.

Starting point is 00:20:46 output at the highest efficiency. And just from a high level, can you explain some of the architecture, how sort of a typical imprint engine works? What are just the few most important components that people would be interested to learn more about? Maybe one goes through a life of a request. Like if I say hello, what would happen to VOLM? Yeah.

Starting point is 00:21:03 Yeah, so basically there's a kind of traditional API server. Definitely, you know, guess the retest and once the model generates the output, it stream backs the tokens one by one. Yeah. So there's definitely a traditional API server. every layer. And inside the net, we have kind of typically something called tokenizer, right, like to transform this like inputs to like the tokens, the basically some integers, the least advantages that the language model can consume. And inside of it, we have basically

Starting point is 00:21:32 an engine, what we call like engine. And that includes a scheduler to, you know, which decides how to batch the recast, to incoming recast. And we have a memory manager to manage, to manage, something called KV cache, which is the kind of the core part of the transformer for ETOLEMs. And we definitely have some kind of a worker. This is a very generic term, but which basically actually initialize the model and run the model and get the output and, you know, do all the like pre-processing for the input and, you know, post-processing for the model output. Yeah.

Starting point is 00:22:09 So, yeah, yeah, that's basically, I mean, yeah, in a sense, it's not like a crazy new architecture, but each one basically highly optimized and specialized for this LM inference worker. Do you think it's getting easier or harder over time running inference? Yeah, definitely, I think it is definitely getting much more difficult over time. Like actually, honestly, maybe one and a half years ago, I wasn't thinking the inference as a hard problem at all, to be very honest. But now things have changed. The trend has changed so far.

Starting point is 00:22:45 So I think there are kind of three factors. One is scale. Another is diversity. And the last one is kind of agents. So for scale, you know, like the models are definitely getting larger. And, you know, right now we have Kimi K2 with like more than 100, more than a trillion parameters. But I think we believe we will see like multi-trillion parameter open source model this year. And I think that's still clearly a trend that people will be training a larger model.

Starting point is 00:23:19 And definitely it's much more challenging to deal with such a model compared to, you know, like the early days of the LEMs where we just only deal with like small Lama models. And with larger models, presumably you need more nodes working concurrently you need, you have more memory to manage that may or may not fit in each, you know, chips available memory. You could describe some of the challenges from scale. Yeah, for these kind of large models, we definitely need to charge. you know, distribute the model into multiple, like, GPUs, multiple nodes, right? And then, yeah, then there's, like, definitely, like, a problem of how to short, how to distribute

Starting point is 00:23:54 this model, right? There are actually many dimensions we can use to charge the model, and they have, like, different, like, trade-offs. And, yeah, trade-offs, for example, in terms of, like, how much communication we should pay to share the model in this way. And also, there's a trade-up in terms of, like, load balancing. If I share this in this dimension, then how significant is the load imbalance? So these all need to take into account for the final performance estimation to get the best performance.

Starting point is 00:24:25 And yeah, it could be becoming more and more bigger problem as the models get larger. And what about just cluster scale? I mean, I think, Simon, how many nodes is VLLM running on at any given time? Right now we're looking at this is true our sort of like, a very small subset sample of our usage statistics that's used for us to figure out what feature to deprecate. Just literally from this one signal we're looking at 400K to 500K GPUs 24-7 running VLM. And there's quite a big scale thinking about the global deployment of GPU footprints and we definitely believe there's a lot more out there. And of course, this is a wide

Starting point is 00:25:07 diversity of different kinds of GPUs, GPU architecture, as well as model architecture being deployed. We're not seeing like a one-size-fit-all. People are using it for just one singular use case. I see. And this is sort of your point. Your second point was like about diversity, sort of making it a harder problem over time. Yeah, the chip diversity, harder diversity is definitely one factor. And also models are getting also diverse, you know. If I think about the, like, for example, like for In media, like a year ago, I think they only released, a few series of open source models, but now they're releasing many open source models

Starting point is 00:25:42 every month in different domains, right? Someone on the video, someone on the robotics, someone on the language. And yeah, this kind of like open sourcing trend is getting expanding and that people are training many different kinds of models

Starting point is 00:25:57 in many different domains and releasing them like every month. So there's model diversity. And even for just for text models, they're all transformers in that, but their detailed architecture still are very diverse. And we see they're even diverging. Like say for like DeepSy 3.2 was using sparsal attention,

Starting point is 00:26:18 something called sparsal attention. But say for Q1 and Kimi, they're kind of exploring like linear attention, which is kind of different attention mechanism. And they have different ways to manage the memory. So yeah, this model architecture divergence is also getting, getting more significant. And so is it up to you as, you know,

Starting point is 00:26:40 meaning BLM to implement all of these, like all the two and to, you know, implement sparse attention, for instance, so that it's available for the models to use? Yeah, definitely. We basically leverage open source community, definitely. Like, we, you know, because we collaborate with these model vendors,

Starting point is 00:26:54 like we often get help from these model vendors. They basically provide some kernels or at least like reference implementations of, you know, of these new kind of, like, operations. And yeah, our job is often, like, basically leverage this collaboration and making more mature and also available for more diverse environments. I remember early on in open source models, there was some standardization.

Starting point is 00:27:21 Like everyone was kind of using Lama. I think everyone's using sort of like the same tokenizer and the same like input format and, you know, and like end of stream token and stuff like that. Is that still the case or is it like different for each provider? now. It is, yeah, it diverged quite a bit over the last few years, maybe last couple years. Yeah, one thing is that many, yeah, like the model architecture itself has changed a lot, you know, especially on the attention side. And also, even for like input, output processing, because like different labs have different kind of their own ways to form, you know,

Starting point is 00:27:55 how to form the conversation and how to form the tool calls, for example, for their own models. So now, like, this has been diverging quite a bit. And now, yeah, this has been diverging quite a bit for the last couple years. I see. Okay. So scale of models, diversity of models and hardware deployment scenarios. And then agents were the third thing you mentioned, sort of getting hard over it. Yeah, yeah.

Starting point is 00:28:16 You know, like for agents, we need a definitely we need a kind of different, I mean, beyond just beyond the inference engine, we also need to set up the whole new, like, environment, a whole new, like infrastructure to support all the tool callings and to support all the, yeah, like multi-agents things. Yeah, like that's part of becoming a kind of a new, like emerging challenge or inference as a whole. Do you think this means more, there will be more state managed in the inference layer over time? As before, the paradigm has been texting, text out, and then just single request response. And then, but as we evolve into the year and the decade of agents, we're seeing multi-turn conversation

Starting point is 00:28:58 turning into hundreds and thousands of turns. And then these terms also involves external tool use, like interacting with Sandbox, performing web searches, running Python script or any programming languages, and be able to have this kind of long iterative process where LM is involved, but also external environment interaction is involved. And this really kicked off a huge wave of co-optimizing a genetic architecture with influence architecture. So just to give an example that when, because just to give an example, it is very important for VLM to understand whether or not the conversation is still happening. If the conversation is no longer happening, we can remove the KV cache. That is the persistent state associated with each text completion streams.

Starting point is 00:29:52 But in agentic use cases, you actually don't know whether or not the agent will think it finishes or also wait. the interaction previously, and the interaction previously was just a human typing in the text box. But now it becomes external environment interaction. It could be one second just for a single script to finish. It could be 10 seconds for a search or like a complex analysis to finish. And then it could also be minutes, hours, if there's humans in the loop. Now, with that uncertainty, we actually don't even know when is the request going to come back. And then the uniformly of cash access pattern and eviction pattern got kind of, the patterns got

Starting point is 00:30:34 pretty disrupted by the new paradigm. I see. I see. And so you have to be much smarter about how you manage the cash as one. As one example of that. Yeah. Gotcha. Gotcha.

Starting point is 00:30:44 Which is one of the like unsolvable problems in computer science. Cash invalidation. Yeah. So exactly. So I can see how that would get harder over time. I think I know the answer to this. But are you guys big believers in open source? AI compared to closed source. And can you just explain how you think about that?

Starting point is 00:31:00 We're definitely big believers in open source. What we believe is diversity will triumph that sort of single of anything at all. So that means we believe in diversities in models, diversity in chip architecture. Fundamentally, this is because the world is complex. For any application, you're going to need to find and tailor the right sort of model architecture to the right chip architecture for that right exact use cases. And the best way to promote diversity and improve that is through open source. Because open source, everybody knows where everybody else is up to and be able to make their opinion take based off the common ground. And finally, if you look at Israel Computer Science operating system, cluster managers, databases, every single system field got better when

Starting point is 00:31:51 they're starting to have a common standard and everybody deviate a little bit. innovate on top of each other versus following a single line of trend that is proprietary and single source control. I see. That's very interesting. So you're almost saying OpenAI will tune their stack very tightly for their use case, which is chat GPT or whatever other apps they're running. For an enterprise or another tech company, if I want that same level of tuning, I like can't just use off-the-shelf close source models because I don't sort of have control the whole stack and like the different participants in the stack kind of, aren't paying attention.

Starting point is 00:32:27 Yeah, of course, one part is data, one part is the model architecture itself, which will impact the performance. And then just on the model architecture itself, right? How smart do you want the model to be? Do you want the model to be able to handle millions of contacts, token contacts, or just shorter context is totally fine, right? And then you also need to specialize as a model to your exact compute architecture. What chip are you using?

Starting point is 00:32:51 For example, for Nvidia, the model you designed for a H-100 chip is very, different from a B-200 ship. And then it is very different for a GB-200 MV-7-2 system. And then compared to, for example, the model architecture you design for TPU, then again, that is also drastically different. And then using it for vision model, video generation, and for reasoning mass coding, in the end, we'll all look kind of look at the vertical stat integration. We're like, wow, there's so much different from each other. That makes sense. Can you just share any stories about live BLLN? deployments that you thought were particularly interesting or important? I have a few.

Starting point is 00:33:32 One is, I think around 2024, we learned that Amazon is running VOL to power their Rufus, a system bot, which was like really surprising to all of us, because one, as a point, like, of course, like, we believe VLN can be deployed at scale, but seeing this as a massive of scale, like, kind of global ecommerce deploying this as, like, front page feature. That means when everybody, when they're opening Amazon app and clicking the bots suggestion, or even entering a search query, is going through a VOM. And this is kind of the first sort of magical experience in a way. One of the first experience was, wow, my purchase is going through VOM right now.

Starting point is 00:34:15 It's kind of exciting, but also scary. You're like PhD students at the time. Yeah. And also across not just Amazon, LinkedIn. and every major deployment of VOLM, we're surprised to find out they're always the first adopter of cutting-edge features. So I've seen one of the example of deployment of VOLM

Starting point is 00:34:34 within Character AI was when we first made the N-Grant speculation for a spectacode available as just a single PR, poor request in VOLM, not even merged. And then while we're still iterating on that feature, and I heard somewhere from Character AI saying, oh, actually, we already wrote it out. hundreds of GPUs at scale given just your first iteration of this feature.

Starting point is 00:34:57 So it's really much everybody is staying on the cutting edge of VOLM and we're quite excited about that. Okay, should we talk about the company then? Infrax. What is Infrax and why did you guys decide to start the company? So Infrax created by the creators and maintainers of the VOM project. Our goal is to make VOM the world's influence engine really put. the capabilities on the open source front, and then builds a universal inference layer. That means we'll have the runtime to power any new model on new hardware for new application, be able to tailor that to extreme efficiency and support all the AI workload going forward.

Starting point is 00:35:41 And implicit in what you just said is that you're devoting a lot of resources, I think, to the open source project. Could you, I guess, is that right? And can you expand on that? Yeah, one thing we believe is, I fundamentally believe, believe that open source, especially how VLM itself is structured, is critical to the AI infrastructure. And what we want to do with Infraq is to support, maintain, steward, and push forward the open source ecosystem. It is only that VLM when VLM becomes a standard and VLM help

Starting point is 00:36:15 everybody to achieve what they need to do, then our company in the sense have the right meaning and to be able to support everybody around it. So open source is definitely number one and in fact, the only priority of our company right now. You're not supposed to tell your investors, by the way. We do believe that open source project is also kind of a secret weapon, in a sense that having this community all work together for this open source, we have the execution beyond any single entity can have.

Starting point is 00:36:49 This is the thing we heard over and over again that people just tell us, we just cannot keep up with VLM. So that's why we're using VLM. We have our internal team. We have our internal fork. We have our internal inference engine. But open source moves so fast that the only way to stay ahead is adopting. And that's why we want to make it happen. And in fact, this is exactly why we're staying all in on open source.

Starting point is 00:37:13 That's awesome. We mentioned Jan Stoica before, obviously one of the founders of Databricks. He was your, I think both of your PhD advisors at Berkeley, and he's going to be involved in Infrax, too. Can you talk about maybe a little bit how he's going to be involved in this company? And even more importantly, what have you guys learned from him as, you know, his students and about startups and, you know, distribute systems and all this stuff? Sure, yeah. Yeah, you're exactly right. Yon is both of our advisors.

Starting point is 00:37:39 I have actually worked with Yon since 2017, since I was an undergrad, working on my first opens up project for serving, and then work with him at any scale for my second over. project for serving. You're just addicted to like Berkeley-based open-source AI serving companies. So as this company, I'm VO-I, Young is quite involved as, so as a company, he will be a co-founder, and then as an open source project, he has been advising this project since its inception. Yang knows open source project, academic project, industry research trend, E&L. So from what we're working together on, Yang really helps. helps us with both clearly understanding all the lessons learned about bringing open source through

Starting point is 00:38:26 the final miles of adoption in companies, enterprises, as well as what is actually happening on the research world. A Sky Computing lab over the last few years has produced amazing infrastructure and new research ideas, and Yang continued to explore a new frontier on that front. And then we're quite excited to hear that and also innovate on the open source together. Yeah, and he also helps recruiting a lot. And, you know, like, all he is involved in all of our hiring process. He basically tells us, I mean, teaches us how to tell, you know, talents,

Starting point is 00:38:59 how to, where to find talents. These are all amazingly helpful. So on that topic, what are some of the big problems you need to solve now and what type of people are you hiring to help you help? Definitely, you know, the inference at scale is kind of the one of the biggest challenge. I think in the field, not only for us, but in the field overall. So we are trying to hire more like a very experienced ML infrared engineers overall to make, you know, for example, you know, how we, what would be the best way to utilize the GB 200, GB200, GB300, MBL 72 rack entirely for the giant open source model. Still, I think it's a open problem.

Starting point is 00:39:41 There are definitely some endeavors in academia and industry, but I think there are some like room for. for improvements. So yeah, that's some of our focus at the moment. Here's my pitch from a computer science point. Pretty rare if people ask me this question. That is, if you're working at a vertically integrated company that have an end product for, let's say, for chatbots, for assistant, you are working on the vertical size of the problem. In Infrax, you will be working on a obstruction. a horizontal layer. And this is similar to operating system, databases,

Starting point is 00:40:22 and different kinds of abstraction that people have built over the years. Operating system, abstracted CPU and memory, databases and file system, abstracted storage devices and networking. For accelerated computing, there's a brand new physical device that inference and VLR abstracted a large part of it for inference-specific work. Of course, it's training,

Starting point is 00:40:44 but we are a singular focus is on inference. And this necessitates a layer, a software layer, that abstract away GPUs and a certain computer device for models. And this is as important from my point of view as abstraction unity built for OS for databases, which are both fields we're really passionate about when we're Ph.G students, too. So that's why MLS system is fundamentally a new system

Starting point is 00:41:14 research and system deployment. So here at Infraq will be working on this layer that is not a vertical slice, but a fundamental runtime and impacting all the future generation of software

Starting point is 00:41:30 that will run on a cellular computing device. And your work will span from both working with different models, and then working with different applications, and as well as understanding the pros and kinds of different chips as well as their whole integrated data center systems to be able to figure out, oh,

Starting point is 00:41:49 actually for these, we should build the abstraction in this way. And we'll constantly remove abstraction, break abstraction, and build it over and over again, just like how operating system got innovated over time, databases got innovated over time, with a new information we have a hand. So you will come here to have the constant exercise of building an actual widely deployed production system that sort of that will be at the frontier of influence. And this is what you call universal inference layer. Yeah. It's purposely vague in a way, but what we really focus on is going from page attention,

Starting point is 00:42:30 from going from the serving system to the whole runtime you need for intelligence. Hussuk, Simon, thank you so much for being here today. thrilled to have you on the podcast, of course. And we're thrilled to be, you know, working together in the company. It feels like it's been a few years. We've already been working together. But, but yeah, great to have you here. And congratulations on getting off to a great start. Thank you for having us. Yeah. Thank you. Thanks for listening to the A16Z podcast. If you enjoy the episode, let us know by leaving a review at rate thispodcast.com slash a 16Z. We've got more great conversations coming your way.

Starting point is 00:43:08 See you next time. As a reminder, the content here is for informational purposes only. Should not be taken as legal business, tax, or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund. Please note that A16Z and its affiliates may also maintain investments in the companies discussed in this podcast. For more details, including a link to our investments, please see A16Z.com forward slash disclosures.

a16z Podcast - Inferact: Building the Infrastructure That Runs Modern AI

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.