The a16z Show - Inferact: Building the Infrastructure That Runs Modern AI

Episode Date: January 22, 2026

Inferact is a new AI infrastructure company founded by the creators and core maintainers of vLLM. Its mission is to build a universal, open-source inference layer that makes large AI models faster, ch...eaper, and more reliable to run across any hardware, model architecture, or deployment environment. Together, they broke down how modern AI models are actually run in production, why “inference” has quietly become one of the hardest problems in AI infrastructure, and how the open-source project vLLM emerged to solve it. The conversation also looked at why the vLLM team started Inferact and their vision for a universal inference layer that can run any model, on any chip, efficiently.Follow Matt Bornstein on X: https://twitter.com/BornsteinMattFollow Simon Mo on X: https://twitter.com/simon_mo_Follow Woosuk Kwon on X: https://twitter.com/woosuk_kFollow vLLM on X: https://twitter.com/vllm_project Stay Updated:Find a16z on YouTube: YouTubeFind a16z on XFind a16z on LinkedInListen to the a16z Show on SpotifyListen to the a16z Show on Apple PodcastsFollow our host: https://twitter.com/eriktorenberg Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

Transcript
Discussion (0)
Starting point is 00:00:00 Our goal is to make VOLM the world's inference engine, really push the capabilities on the open source front, and then build the universal inference layer. That means we'll have the runtime to power any new model on new hardware for new application, be able to tailor that to extreme efficiency and support all the AI workflow going forward. I fundamentally believe that open source,
Starting point is 00:00:25 especially how VLM itself is structured, is critical to the AI infrastructure in the world. And what we want to do with Infraq is to support, maintain, steward, and push forward the open source ecosystem. It is only that VOLM win. VLM becomes a standard and VLM help everybody to achieve what they need to do,
Starting point is 00:00:48 then our company in the sense have the right meaning and to be able to support everybody around it. What if the hardest problem in artificial intelligence isn't training smarter models, but simply keeping them running. For most of the history of computing, once a system was built, the hard part was over. He wrote the program, pressed run, and the machine behaved predictably. Even early machine learning followed that pattern. Inputs were standardized, workloads were regular, the computer did its job and stopped.
Starting point is 00:01:23 Large language models quietly broke that assumption. Every request is different. Prompts can be a sentence or an entire archive. Outputs can end instantly or stretch on indefinitely. Thousands of users can arrive at once, each making incompatible demands on the same hardware. And all of this has to happen in real time on GPUs that were never designed for this kind of unpredictability. Over the last few years, this problem has moved from obscure to essential. As models have grown larger, more diverse, and more deeply embedded into products,
Starting point is 00:01:53 the challenge of running AI systems has started to rival the challenge of building them. That's where the tension lies. A public story of AI progress is about better models and bigger breakthroughs. But underneath it is a quieter systems problem. How do you schedule chaotic requests efficiently? How do you manage memory when you don't know when a conversation is actually finished? And what changes when AI systems stop behaving like single-turned tools and start acting like agents that think, pause, and interact with the world over time?
Starting point is 00:02:21 This episode focuses on the hidden layer. We examine why inference, the act of running trained AI models, has become one of the most complex and important problems in modern computing and why open source infrastructure is increasingly central to solving it. Matt Bornstein, general partner at Andresen Horowitz, is joined by Simon Moe and Rousa Kwan, co-founders of Infraact and creators of the open source inference engine VLM. This is a conversation about the infrastructure beneath AI
Starting point is 00:02:48 and why it may matter more than the models themselves. We are here today with Simon Moe and Wusa Kwan, lead contributors on the VLM open source project, and co-founders of Infraact, a new AI inference company. Super excited to have you guys on the show today. Thank you. Thank you so much for coming.
Starting point is 00:03:09 We're going to talk a little bit about BLLM, the Open Source Project. We're going to talk a lot about inference and what inference technology really is. And then we'll talk a little bit about infract the new company. So to start, can you talk a little bit about where BLLM came from? What is it? How did you start it? And why is it such an exciting project? Thank you for having us.
Starting point is 00:03:26 VLM project started from actually Wistook's a prototype project at UC Berkeley doing his PhDs and grow into today's open source project on GitHub for inference around time for everybody. Maybe Wilson can talk a little bit about the page attention paper. Oh, yeah. So basically, I think it kind of started in 2022 when Mata released the OPT model as open source. I'm not actually sure how many people actually like remember the model nowadays, but it was kind of the one of the first like open weight larger language models to reproduce GPD3. and our lab tried created a demo service to run the model and to you know demonstrate it for the broader audience and yeah like it was working but super slow so i started a small side project to optimize that demo service that was kind of at the beginning
Starting point is 00:04:19 And initially I was thinking that it may only take like a couple weeks to optimize the service and to when. But it turns out that it actually has a lot of open problems inside the net. Because this auto-regressive language model is pretty different. Actually, it was pretty different from other traditional like ML workloads. And it wasn't actually, it was kind of like a brand new. And this like outside this Frontier Labs back in the day, I started to work on it. And it became a research project and we wrote a paper. And it even became like an open source project, pretty well-defined open-source project,
Starting point is 00:04:55 as more and more people got interested in it. So, 2022, this is pre-GPT4, obviously. Yeah, pre-ChatGPT. Yeah, pre-ChatGPT. Yeah. And you're thinking like, oh, I'll just like work on this inference server. This should be a fairly straightforward problem. Like four years later, actually you're like doing more work instead of less.
Starting point is 00:05:14 Exactly, exactly. Yeah. Why did you think this is a meaningful problem to work on at the time? Because, like, I would say most people in the world at that time saw GPT3 as a curiosity in some sense. And OPT was kind of like a curiosity attached to a curiosity in a way. Like, what made you and your lab mate sort of excited to work on this back then? I think I also started from curiosity. I didn't really think it's the most important problem in the world back in the day.
Starting point is 00:05:39 I just wanted to have a hands-on experience on how this actually works. I mean, I think I'm also impressed by the size of the model. The OPT largest model has 175 billion parameters. And that was the largest model available. So it's kind of like a meaningful for me. Like it was kind of pretty rewarding to work on such a large model. This reminds me of when like when I was like growing up, you know, we would build like computers. That was like the cool thing to do.
Starting point is 00:06:02 And each step change in like memory capacity was such a big. I was like, oh my God, this one has four megabytes of RAMs. Oh my God. This one has 512 megabytes of RAM. Looking back, it's silly. But at the time, that was actually like maybe it's because we're like nerds. But like it gets, you get like emotionally excited about like. numbers getting bigger on these systems.
Starting point is 00:06:20 Right, right, right. Yeah, I think that was like one of the main motivation clearly. And so you started to say the sort of technical problem is different for auto-rogressive transformers compared to traditional machine learning. Do you mind explaining a little bit, you know, how that is? And even compare just to normal kind of computing workloads for, you know, listeners who, you know, our engineers may not be familiar with AI workload. So basically compared to the traditional workload, you know, the clear difference is definitely like GPUs, right? Now all the computers or kind of most of the computer happening on GPU.
Starting point is 00:06:53 And we have to optimize for the, which presumably have last memory than CPU, at least back in the day. Now the GPUs are much, has much larger memory, but typically it has much smaller memory than CPU, maybe still. And like, you know, like all the computations happens on GPU. So you have to write program in a different language and a different type of parallelism in mind. Yeah. So that's kind of like a fundamental different. from the traditional compute happy workload versus deep learning workload, I would say.
Starting point is 00:07:25 But within the dim learning workload, there's actually still a huge difference between the kind of traditional dim learning workload versus like larger language model inference. So for traditional workload, I think the biggest kind of characteristic is that it is pretty static, which means, for example, like for image models,
Starting point is 00:07:44 like back in the day, like CNNs, what people do is, you know, we may have several images with different sizes. Then what we do is we resize them or crop them into the same size, and then we batch them, and then we put it to the model to run the inference at once. And this is basically, yeah, because of this resizing and propping, like they all kind of, at the end,
Starting point is 00:08:08 they're kind of compressing to the same size tensor, and that actually makes things much simpler for the GPU to handle, right? all the shapes are pretty regular, static, and it's kind of like well-defined. But for a large language model, if you think about it, they're pretty dynamic. Your prompt can be either like hello, like a single word, or your prompt can be like a bunch of like documents spanning like hundreds of pages. And this kind of like dynamism exists inherently in the language model. And this makes things a whole like kind of in the different world.
Starting point is 00:08:44 We have to handle this dynamism as a first-class citizen. And, yeah, back in the day, that was not like, people didn't have a clear idea about how to handle it. And, yeah, fortunately, we were one of the first to solve a, to see the problem. That's very interesting. So kind of regularizing a batch of inputs was, it sounds like one of the first problems that you had to solve. It's actually more about scheduling and memory management. Yeah, yeah. as well.
Starting point is 00:09:13 Yeah. Yeah. So the problem we're solving before in all the serving system is about just what we call micro-batching to leverage first CPU's fundamental like vectorization in the early days before LMs. And then early GPU for vision models like Resnet is all about micro-batching. You put together four requests together that arrive around the same time. But the change in the LM world is you always have requests that continuously filling and coming
Starting point is 00:09:41 in. and then each request looks differently. You just cannot really normalize them. So that's why you have to have a notion of a step within the LM engine to process one token across all the requests at the same time, regardless of each request having different kinds of input lens and output lens. Nowpoo is also non-deterministic. The language model itself will decide when does it stop.
Starting point is 00:10:04 Instead of in the traditional sense of other machine learning servings, it's very much like work like a clockwork. And here it is very sarcastic. it's always flowing, it is always continuous. That's why scheduling is the first problem to solve. And then memory management, that's where page attention come about, is the second problem to solve. So when did you get involved in the project, Simon?
Starting point is 00:10:25 Well, I got involved in around 2020, 3. I first, Wussuk issued a call in the Skylab Slack channel to say, hey, we need someone to work with us on this page attention paper and kernel. Actually, surprisingly, I was on spring break. and I was like, look, someone else can do this. And then we just play with GBT for the entire week. So I would just end up just playing with prompt engineering. So I actually didn't end up joining with Lusuk.
Starting point is 00:10:50 And so this is what a vacation looks like in Janstoyka's lab, playing with models for a week. And he's playing with kernels. Yeah, exactly. So he's playing with kernels. I was trying to build more prompt engineering and explore like different kind of early agentic workflow. And then over the summer, and especially this is when around August and September,
Starting point is 00:11:09 and we really get to work together. Actually, this is where you come in. We get to work together on our very first VLM meetup, ACQNZ, and where I had the experience of managing open source project before, as well as deeply interested in actually building a serving platform and into a fully open source project. And this is where I start to get involved, roll my first lines of code,
Starting point is 00:11:35 and sort of build up the CS system, and built as a performance benchmarking systems, and then really much work with Wooslok ever since. I had forgotten about that. So this was the very first VLM meetup, right? Yeah, I was in this office. In this office. I'm the exact floor.
Starting point is 00:11:51 I think we are previously anticipating just 10, like 10, 20, maybe 50 people showed up, and then the registration was like exactly over the anticipated capacity. People are extremely interested in this technology. I remember that very well because we run events here for ourselves. And it's always very hard to get people to show up. We're always scrambling. And instead, I got a call from our security team saying, too many people have been approved for this field and meetup. We need to scale it back. This isn't safe. I'm like, oh, okay. Probably don't tell. I don't think we
Starting point is 00:12:23 ever scaled it back. So don't tell the security. It was quite crowded. The piece I ran out, like the first 10 minutes. But this is a big deal, right? Because this is not like a consumer app, right, that you were building. This is pulling from systems engineers, right? For the most part, who want to learn about how to serve LLMs and contribute to it. So it's actually a big deal to get, I think, so much interest from such a kind of narrow, sophisticated group of people who don't like meeting other humans in real life that often either, you know, at least speaking for myself. So can you talk a little bit more about the community behind VLM?
Starting point is 00:12:54 Like, how big is it now? How did it come together? And like, how do you guys manage it as it's gotten big? Yeah. So in the beginning, of course, it's just a few grad students working on it. And then so, and by over time, we started to, having this very much an open-minded, developing the open kind of mindset. So as of now, we're looking at 50 or more regular full-time contributor
Starting point is 00:13:18 who open up GitHub every single day to work on VOLM, way across 2,000 contributor bars on GitHub, one of the fastest growing top open-source project ranked by GitHub itself. And then this is really a diverse community. So there is folks like Usuk and I are sort of the team from UCB, Berkeley from grad student days and as well as meta and red hat pulling their way behind this open source project and then as well as of course people who are not just make people who are making the model bestroll and quen team and of course like anyone who's making open way model are participating in
Starting point is 00:13:56 our community and then on the model side invidia amd google a2s intel they're all having their own participation and be able to support the ecosystem so everyone in Vio, using VAL has the ability to choose about different SETICANS for accelerated computing. That's very interesting, though, which I think is a property that many successful open source projects have, which is that people aren't all contributing for the same reason. Some people, I'm sure, just love the technology. But it sounds like you're saying the model providers actually have incentives to contribute to the project because they want their models to run well.
Starting point is 00:14:29 The silicon providers want it to run well in their silicon. The infra providers want to have first divs on running it so they can sell infra, that kind of thing. Yeah, this is kind of a classic worth solving the M-times-and-problem so that as a model provider, you don't need to talk to everybody. And as a hardware provider, you can just go into this one system, and then magically, you'll work for all the models out there in the world. And then for applications who are using VOLM as well as infrastructure, building with VOLM, like having a common ground where everybody can participate in and then innovate together
Starting point is 00:15:02 is way easier and cheaper, in fact, in the end to deploy. What's your philosophy for managing a pool of contributors this large? Do you tell them what to do? Do they choose themselves? How do you maintain high code quality? It's a constant sort of iteration months over months, year after years. So for this, I have to go back to my previous OpenSouth project, which I was working on a project called Ray and then later any scale, where we have this kind of, where I learned this a community-driven approach in a way that have a clear requirement, have a clear rule of have a clear sort of milestone being set. So we kind of try to borrow that, but also really study this really successful open source project out there. I went all the way back to NNX and then to study Kubernetes,
Starting point is 00:15:49 study Postgres. How are these communities operating in together? So in VOL and we had kind of a special model that we do, like any normal engineering organization, set clear team scope, but also clear objective and results and milestones with different kind of technologies, technical features we want to push forward and build. So this is where we have set forward our vision every quarter.
Starting point is 00:16:14 And then, but also invites the community to contribute. So we're saying, great, we're working on these. We also need help on these items that we don't have anyone actively working on. If you are brand new and want to engage with us or engage with the community, here's what you can work on. And additionally, we keep an extremely open mind to all the GitHub pool requests that people just opened up that we're seeing, oh, is this a good request? Is this a good feature?
Starting point is 00:16:39 And then as well as a request for common processes. So kind of is a blend of all the lesson learned from previously other open source project. And then code quality-wise, code reviews, but also a lot of constant refactoring iterations. Yeah, yeah. I do a lot of refactoring, like, every six months, kind of, yeah. And actually, one thing to add is, you know,
Starting point is 00:17:00 like we do in-person meetups, you know, like every two months and we're kind of expanding to globally actually like sometimes in the Europe, sometimes in some other places in Asia. Yeah. And yeah, like we actually from the first META in A-NZ, we learned that it's actually super, super useful to meet, you know, those like collaborators and, you know, users in person. And yeah, we are continuing doing that. It's funny.
Starting point is 00:17:25 It's another one of these lessons that like, you know, Silicon Valley engineers, like we've gotten so kind of like, you know, high up the abstraction stack that we're. like relearning, you know, lessons from a thousand years ago, saying, oh, it turns out in-person communication is high bandwidth and doesn't suffer from consistency problems. So, so around the time you guys did that first meetup, we also made grant funding to the project through the academic lab. I think it was a small amount of money, but it was actually the very first open source grant that we made. So it's super, you know, just like fun and kind of gratifying for us to see like the money was actually put to good use and the project crew massively. And then we even had a chance
Starting point is 00:18:02 to invest in the related company later. However, I did hear a rumor that at the time that we made the grant funding that you guys put a portion of the money into Nvidia stock. Can you confirm or deny? No, I did it. Not him. So someone else in the recipient list. So you probably turned our tiny grant into 10 times as much money before it was.
Starting point is 00:18:22 Oh, sort of sort of the funding for VOLM. A lot of these funding for BLM is that we set aside for project development and sort of project development, testing, and everything around operating this project. And once you know we're actually super grateful for the first grant, it's actually kicked off a culture. And nowadays, you can get even a tradition, for people really opened up to sponsor open source projects in a quite significant way. Because running VOL, our CI bill, for example, is more than 100K amounts.
Starting point is 00:18:56 That could be tiny for some folks. And it's like overgrowing over time. This is we're at a burn of million dollar amounts and a year, a million dollar a year. For an academic project, it's actually very... Yeah, because we want to make sure every single commits is well tested. And then this is something that people are going to deploy at not thousands, but potentially millions of GPUs across the world in different environments. So we want to make sure it's well tested.
Starting point is 00:19:22 It is reliable. And then this requirement, this infrastructure, or right now all comes from contribution and sponsorship and from everybody are chipping into help on this project. And now, of course, we also run meetups, and sometimes expenses associated with meetups are directly leveraging the sort of the grants that you all provide it. Yeah, I mean, it makes sense. You know, for us and for other corporate sponsors of the LLM, it, you know, it benefits the whole ecosystem, right?
Starting point is 00:19:52 So I think it makes a lot of sense. Let's talk more about the technical aspects of the problem, if that's okay with you guys. Do you mind to start just defining exactly what? like an inference server or an inference engine is? Sure. So an inference engine turns, it takes a already-trained model.
Starting point is 00:20:10 So this can be a very small model like Q1B. It could be a very big model on DeepSQL, Kimmy K2, run it on a cellular computing device. And its job is to fully utilize the computing device to be able to generate text and images and videos, essentially, but this all got tokenized into individual tokens. So the goal of inference engine is to produce, the goal of inference engine is to run the model at highly efficient speed to make sure that we can produce maximum outputs at the highest
Starting point is 00:20:47 efficiency. And just from a high level, can you explain some of the architecture, how sort of a typical inference engine works? What are just the few most important components that people would be interested to learn more about? maybe one goes through a life of a request. Like if I say hello, what would happen to VOLM? Yeah.
Starting point is 00:21:03 Yeah, so basically there's a kind of traditional API server. Definitely, you know, guess the retest and maybe, and once the model generates output, it streambacks the tokens one by one. Yeah, so there's like definitely a traditional API server layer. And inside an internet, we have kind of typically something called tokenizer, right, like to transform this like inputs to like the tokens, basically some integers, the least of vintage. that the language model can consume.
Starting point is 00:21:31 And inside of the engine, what we call like engine, and that includes a scheduler to, you know, which decides how to batch the recast, to incoming recast. And we have a memory manager to manage something called KV cache, which is kind of the core part of the transformer for other limbs.
Starting point is 00:21:51 And we have a definitely have some kind of worker. This is a very generic term, but which basically actually initialize the model and run the model and get the output and do all the pre-processing for the input and, you know, post-processing for the model output. Yeah, so, yeah, that's basically, I mean, in a sense, it's not like a crazy new architecture,
Starting point is 00:22:14 but each one basically highly optimized and specialized for this LM inference workflow. Do you think it's getting easier or harder over time running inference? Yeah, definitely, I think it is, definitely much getting much more difficult over time. Actually, honestly, maybe one and a half years ago, I wasn't thinking, like, in France, there's a hard problem at all, to be very honest. But now things have changed. The trend has changed so far. So I think there are kind of three factors.
Starting point is 00:22:47 One is scale. Another is diversity. And the last one is kind of agents. So for scale, you know, the models are definitely getting larger. And, you know, right now we have Kimi K2 with like more than 100, more than a trillion parameters. But I think we believe we will see like multi-trillion parameter open source model this year. And I think that's still clearly a trend that people will be training a larger model. And, you know, definitely it's much more challenging to deal with such a model compared to, you know, like the only days of at a lens where we just only deal with like small llama models.
Starting point is 00:23:28 And with larger models, presumably you need more nodes working concurrently you need you have more memory to manage that may or may not fit in each, you know, chips available memory. You could describe some of the challenges from scale. Yeah, for these kind of large models, we definitely need to shard, you know, distribute the model into multiple like GPUs, multiple nodes, right? And then, and yeah, then there's like definitely like a problem of how to, chart how to distribute this model, right? There are actually many dimensions we can use to charge the model,
Starting point is 00:23:59 and they have different trade-offs. And, yeah, trade-offs, for example, in terms of how much communication we should pay to share the model in this way. And also, there's a trade-up in terms of, like, load balancing. If I share this in this dimension, then how significant is the load imbalance? So these all need to take into account for the final performance estimation to get the best performance. And yeah, it could be becoming more and more
Starting point is 00:24:28 a bigger problem as the models get larger. And what about just cluster scale? I mean, I think, Simon, how many nodes is VLM running on at any given time? Right now, we're looking at, this is true, our sort of like a very small sample of our usage statistics that's used for us to figure out
Starting point is 00:24:49 what feature to deprecate. Just literally from this one signal we're looking at 400K to 500K GPUs 24-7 running VLM. And there's quite a big scale thinking about the global deployment of GPU footprints, and we definitely believe there's a lot more out there. And of course, this is a wide diversity of different kinds of GPUs, GPU architecture, as well as model architecture being deployed. We're not seeing like a one-size-fit-all.
Starting point is 00:25:16 People are using it for just one singular use case. I see. And this is sort of your point. Your second point was about diversity, sort of making it prints a harder, problem over time. Yeah, the chip diversity, harder diversity is definitely one factor. And also models are getting also diverse, you know. If you think about the, like, for example, like for Media, like a year ago, I think they only released a few series of open source models. But now they're releasing many open source models like every month in different domains, right? So on the video,
Starting point is 00:25:46 someone on the robotics, someone on the language. And yeah, this kind of like open sourcing trend is getting expanding and that people are training many different kinds of models in many different domains and releasing them like every month. So there's model diversity and even for just for text models, they're all transformers in that, but their detailed architecture still are very diverse and they're even, we see they're even diverging. Like say for like deep six 3.2 was using sparsal attention, something called sparsal attention, but say for Q1,000. and Kimi, they're kind of exploring, like, linear attention, which is kind of different attention mechanism, and they have different ways to manage the memory. So, yeah, this model
Starting point is 00:26:33 architecture divergence is also getting, getting more significant. And so is it up to you, as, you know, meaning BLLM to implement all of these, like all the two, you know, implement sparse attention, for instance, so that it's available for the models to use? Yeah, definitely, we We basically leverage open source community definitely. Like we, you know, because we collaborate with these model vendors, like we often get help from these model vendors. They basically provide some kernels or at least like reference implementations of, you know, of these new kind of like operations.
Starting point is 00:27:05 And yeah, we like our job is often like basically leverage this collaboration and making more mature and also available for more diverse environments. I remember early on in open. open source models. There was some standardization. Like everyone was kind of using Lama. I think everyone's using sort of like the same tokenizer and the same like input format and, you know, and like end of stream token and stuff like that. Is that still the case or is it like is a different for each provider now? It is, yeah, it diverged it quite a bit over the last few years, maybe last couple years. Yeah. One thing is that many, yeah, like the model architecture itself has changed a lot, you know,
Starting point is 00:27:45 especially on the attention side. And also even for like input output processing, because like different labs have different kind of their own ways to form, you know, how to form the conversation and how to form the tool calls, for example, for their own models. So now like this has been diverging quite a bit. And now, yeah, this has been diverging quite a bit for the last couple of years. I see.
Starting point is 00:28:07 Okay. So scale of models, diversity of models and hardware deployment scenarios. And then agents for the third thing you mentioned, and sort of getting hard over it. Yeah, yeah. You know, like for agents, we need a, definitely we need a kind of different, I mean, beyond, just beyond the inference engine,
Starting point is 00:28:23 we also need to set up the whole new, like, environment, a whole new, like, infrastructure to support all the tool callings and to support all the, yeah, like multi-agents things. Yeah, like that part becoming a kind of a new, like emerging challenge for inference as a whole. Do you think this means more, there will be more state managed in the,
Starting point is 00:28:44 imprint slayer over time. As before, the paradigm has been texting, text, out, and then just single request response. But as we evolve into the year and the decade of agents, we're seeing multi-turn conversation turning into hundreds and thousands of turns. And then these terms also involves external tool use, like interacting with sandbox, performing web searches, running Python script or any programming languages and be able to have this kind of long iterative process where LM is involved but also external environment interaction is involved. And this really kicked off a huge wave of co-optimizing a genetic architecture with influence architecture. So just to give an example that when, just to give an example,
Starting point is 00:29:36 it is very important for VLM to understand whether or not the conversation. is still happening. If the conversation is no longer happening, we can remove the KV cache. That is the persistent state associated with each text completion streams. But in agentic use cases, you actually don't know whether or not the agent will think it finishes or also wait the interaction previously, and the interaction previously was just a human typing in the text box. But now it becomes external environment interaction. It could be one second just for a single script to finish. It could be 10 second for a search or like a complex analysis to finish. And then it could also be minutes, hours, if there's humans in the loop.
Starting point is 00:30:21 Now, with that uncertainty, we actually don't even know when is the request going to come back. And then the uniformly of cache access pattern and eviction pattern got kind of, the patterns got pretty disrupted by the new paradigm. I see. I see. And so you have to be much smarter about how you manage the cache as one. As one example of that, yeah. Gotcha, gotcha, gotcha, which is one of the like unsolvable problems in computer science. Cache invalidation. Yeah, so exactly.
Starting point is 00:30:50 So I can see how that would get harder over time. I think I know the answer to this, but are you guys big believers in open source AI compared to closed source? And can you just explain like how you think about that? We're definitely big believers in open source. What we believe is diversity will triumph that sort of. that sort of single of anything at all. So that means we believe in diversities in models,
Starting point is 00:31:15 diversity in chip architecture. Fundamentally, this is because the world is complex. For any application, you're going to need to find and tailor the right model architecture to the right chip architecture for that right exact use cases. And the best way to promote diversity and improve that is through open source. Because open source, everybody know where everybody else is up to and be able to make their opinionate take based off the common ground.
Starting point is 00:31:41 And finally, if you look at Israel Computer Science, operating system, cluster managers, databases, every single system field get better when they're starting to have a common standard and everybody that deviate a little bit, innovate on top of each other, versus following a single line of trend that is proprietary and single source control. I see. That's very interesting. So you're almost saying OpenAI will tune their stack very tightly for their use case, which is chat GPT or whatever other apps they're running. For an enterprise or another tech company,
Starting point is 00:32:15 if I want that same level of tuning, I like can't just use off-the-shelf close-source models because I don't sort of have control the whole stack. And like the different participants in the stack kind of aren't paying attention. Yeah, of course, one part is data. One part is the model architecture itself, which will impact the performance. And then just on the model architecture,
Starting point is 00:32:34 architecture itself, right? How smart do you want the model to be? Do you want the model to be able to handle millions of contacts, token contacts, or just shorter context is totally fine, right? And then you also need to specialize that model to your exact compute architecture. What chip are you using? For example, for Nvidia, the model you design for a H-100 chip is very different from a B-200 ship, and then it is very different for a GB-200 MV-L-72 system. And then compared to, for example, the model architecture you designed for TPU, then again, that is also drastically different. And then using it for vision model, video generation, and for reasoning mass coding, in the end, we'll all look kind of look at the vertical stat integration.
Starting point is 00:33:19 We're like, wow, there's so much different from each other. That makes sense. Can you just share any stories about live BLLM deployments that you thought were particularly interesting or important? I have a few. One is, I think around 2024, we learned that Amazon is running VOLM to power their Rufus assistant bot, which was like really surprising to all of us because one as a point, like, of course, like, we believe VLM can be deployed at scale, but seeing this as a massive scale, like kind of global
Starting point is 00:33:54 ecommerce deploying this as like front page feature. That means when everybody, when they're opening Amazon app and clicking the, bots suggestion or even entering a search query is going through a VOM. And this is kind of the first sort of magical experience in a way. One of the first experience was, wow, my purchase is going through VOLM right now. It's kind of exciting, but also scary. You're like PhD students at the time. And also across not just Amazon, LinkedIn, and every major deployment of VOM,
Starting point is 00:34:26 we're surprised to find out they're always the first adopter of cutting-edge features. So I've seen one of the example of deployment of VOM within Character AI was when we first make the N-Grant speculation for a spectacode available as just a single PR, pull request in VLM, not even merged. And then while we're still iterating on that feature, and I heard some from Character AI saying, oh, actually, we already wrote it out to you hundreds of GPUs at scale given just your first iteration of this feature. So it's really much everybody is staying on the cutting edge.
Starting point is 00:35:00 of VOLM and we're quite excited about that. Yeah. Okay, should we talk about the company then? Infrax. What is Infrax and why did you guys decide to start the company? So Infraq created by the creators and maintainers of the VOM project. Our goal is to make VOM the world's influence engine, really push the capabilities on the open source front and then builds a universal inference layer.
Starting point is 00:35:29 means we'll have the wrong time to power any new model on new hardware for new application, be able to tailor that to extreme efficiency and support all the AI workload going forward. And implicit in what you just said is that you're devoting a lot of resources, I think, to the open source project. Could you, I guess, is that right? And can you expand on that? Yeah, one thing I believe is, I fundamentally believe that open source, especially how VOLM itself is structured, is critical to.
Starting point is 00:35:59 the AI infrastructure. And what we want to do with Infraq is to support, maintain, steward, and push forward the open source ecosystem. It is only that VLM when VLM becomes a standard and VLM help everybody to achieve what they need to do, then our company in the sense have the right meaning and to be able to support everybody around it. So open source is definitely number one and in fact something's the only priority of our company right now. You're not supposed to tell your investors, by the way, that we do believe that open source project is also kind of a secret weapon in a sense that having this community all work together for this open source, we have the execution beyond any single entity can have. This is the thing we heard over and over again
Starting point is 00:36:51 that people just tell us, we just cannot keep up with VLM. So that's the thing we're not. We just cannot keep up with VLM. That's why we're using VOL. We have our internal team. We have our internal fork. We have our internal inference engine. But open source moves so fast that the only way to stay ahead is adopting. And that's why we want to make happen. And in fact, this is exactly why we're staying all in on open source.
Starting point is 00:37:13 That's awesome. We mentioned Jan Stoica before, obviously one of the founders of Databricks. He was I think both of your PhD advisors at Berkeley. And he's going to be involved in Infraact too. Can you talk about maybe a little bit how he's going to be involved in this company? And even more importantly, what have you guys learned from him as, you know, his students and about startups and, you know, distribute systems and all this stuff? Sure. Yeah.
Starting point is 00:37:35 Yeah, you're exactly right. Young's both of our advisors. I have actually worked with Young since 2017 since I was an undergrad working on my first opens up project for serving and then work with him at any scale for my second opens up for serving. You're just addicted to like Berkeley-based open source of the I serving company. So as this company, and VOL, Yang is quite involved. So as a company, he will be a co-founder. And then as an open source project, he has been advising this project since its inception. Yang knows open source project, academic project, industry research trend, E&L.
Starting point is 00:38:13 So from what we're working together on, Ian really helps us with both clearly understanding all the lessons. learned about bringing open source through the final miles of adoption in companies, enterprises, as well as what is actually happening on the research world. A Sky Computing Lab over the last few years has produced amazing infrastructure and new research ideas, and Yang continued to explore a new frontier on that front, and then we're quite excited to hear that and also innovate on the open source together. Yeah, and he also helps recruiting a lot. And, you know, like all he is involved in all of our hiring process.
Starting point is 00:38:55 He basically tells us, I mean, teaches us how to tell, you know, talents, how to, where to find talents. These are all amazingly helpful. So on that topic, what are some of the big problems you need to solve now and what type of people are you hiring to help you help? Definitely, you know, the inference at scale is kind of the one of the biggest challenge. I think in the field, not only for us, but in the field overall. So we are trying to hire more like a very experienced ML, ML Infra Engineers overall to make, you know, for, for example, you know, how we, what would be the best way to utilize the GB 200, GB200, GB-300, MBL 72 rack entirely for the giant open source model.
Starting point is 00:39:39 Still, I think it's a open problem. There are definitely some endeavors in academia and industry, but I think there are some like room for improvements. So, yeah, that's some of our focus at the moment. Here's my pitch from a computer science point. Pretty rare if people ask me this question. That is, if you're working at a vertically integrated company that have an end product for, let's say, for chatbots, for assistant, you are working on the vertical size of the problem.
Starting point is 00:40:12 In Infrax, you will be working on an obstruction of horizontal layer. And this is similar to operating systems. system, databases, and different kinds of abstraction that people have built over the years. Operating system, abstracted CPU and memory, databases and file system, abstracted storage devices and networking. For accelerated computing, there's a brand new physical device that inference and VLR abstracted a large part of it for inference-specific work. Of course, it's training, but we are a singular focus is on inference.
Starting point is 00:40:48 and this necessitates a layer, a software layer, that abstract away GPUs and a certain computing device for models. And this is as important from my point of view as abstraction unity build for OS for databases, which are both fields we're really passionate about when we're Ph.G students, too. So that's why M.O.S.C. is fundamentally a new system research and system deployment. So you, here at Infraq will be working on this layer that is not a vertical slice, but a fundamental, but a fundamental runtime and impacting all the future generation of software that will run on a cellular computing device. And your work will stand from both working with different models, different, and then working with different applications. and as well as understanding the pros and kinds of different chips,
Starting point is 00:41:45 as well as their whole integrated data center systems, to be able to figure out, oh, actually, for these, we should build the abstraction in this way. And we'll constantly remove abstraction, break abstraction, and build it over and over again, just like how operating system got innovated over time, databases got innovated over time, with a new information we have a hint. So you will come here to have that.
Starting point is 00:42:09 constant exercise of building an actual widely deployed production system that will be at the frontier of influence. And this is what you call universal inference layer. Yeah. It's purposely vague in a way, but what we really focus on is going from page attention, from going from the serving system to the whole runtime you need for intelligence. We suck, Simon. Thank you so much for being here today. Thrilled to have you on the podcast, of course.
Starting point is 00:42:43 And we're thrilled to be, you know, working together in the company. It feels like it's been a few years. We've already been working together. But yeah, great to have you here. And congratulations on getting off to a great start. Thank you for having us. Yeah. Thank you.
Starting point is 00:42:57 Thanks for listening to the A16Z podcast. If you enjoy the episode, let us know by leaving a review at rate thispodcast.com slash a16Z. We've got more great conversations coming your way. See you next. time. As a reminder, the content here is for informational purposes only. Should not be taken as legal business, tax, or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund. Please note that A16Z
Starting point is 00:43:25 and its affiliates may also maintain investments in the companies discussed in this podcast. For more details, including a link to our investments, please see A16Z.com forward slash disclosures.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.