The a16z Show - Inferact: Building the Infrastructure That Runs Modern AI
Episode Date: January 22, 2026Inferact is a new AI infrastructure company founded by the creators and core maintainers of vLLM. Its mission is to build a universal, open-source inference layer that makes large AI models faster, ch...eaper, and more reliable to run across any hardware, model architecture, or deployment environment. Together, they broke down how modern AI models are actually run in production, why “inference” has quietly become one of the hardest problems in AI infrastructure, and how the open-source project vLLM emerged to solve it. The conversation also looked at why the vLLM team started Inferact and their vision for a universal inference layer that can run any model, on any chip, efficiently.Follow Matt Bornstein on X: https://twitter.com/BornsteinMattFollow Simon Mo on X: https://twitter.com/simon_mo_Follow Woosuk Kwon on X: https://twitter.com/woosuk_kFollow vLLM on X: https://twitter.com/vllm_project Stay Updated:Find a16z on YouTube: YouTubeFind a16z on XFind a16z on LinkedInListen to the a16z Show on SpotifyListen to the a16z Show on Apple PodcastsFollow our host: https://twitter.com/eriktorenberg Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.
Transcript
Discussion (0)
Our goal is to make VOLM the world's inference engine,
really push the capabilities on the open source front,
and then build the universal inference layer.
That means we'll have the runtime to power any new model
on new hardware for new application,
be able to tailor that to extreme efficiency
and support all the AI workflow going forward.
I fundamentally believe that open source,
especially how VLM itself is structured,
is critical to the AI infrastructure in the world.
And what we want to do with Infraq
is to support, maintain, steward,
and push forward the open source ecosystem.
It is only that VOLM win.
VLM becomes a standard
and VLM help everybody to achieve what they need to do,
then our company in the sense have the right meaning
and to be able to support everybody around it.
What if the hardest problem in artificial intelligence
isn't training smarter models, but simply keeping them running.
For most of the history of computing, once a system was built, the hard part was over.
He wrote the program, pressed run, and the machine behaved predictably.
Even early machine learning followed that pattern.
Inputs were standardized, workloads were regular, the computer did its job and stopped.
Large language models quietly broke that assumption.
Every request is different.
Prompts can be a sentence or an entire archive.
Outputs can end instantly or stretch on indefinitely.
Thousands of users can arrive at once, each making incompatible demands on the same hardware.
And all of this has to happen in real time on GPUs that were never designed for this kind of unpredictability.
Over the last few years, this problem has moved from obscure to essential.
As models have grown larger, more diverse, and more deeply embedded into products,
the challenge of running AI systems has started to rival the challenge of building them.
That's where the tension lies.
A public story of AI progress is about better models and bigger breakthroughs.
But underneath it is a quieter systems problem.
How do you schedule chaotic requests efficiently?
How do you manage memory when you don't know when a conversation is actually finished?
And what changes when AI systems stop behaving like single-turned tools
and start acting like agents that think, pause, and interact with the world over time?
This episode focuses on the hidden layer.
We examine why inference, the act of running trained AI models,
has become one of the most complex and important problems in modern computing
and why open source infrastructure is increasingly central to solving it.
Matt Bornstein, general partner at Andresen Horowitz,
is joined by Simon Moe and Rousa Kwan,
co-founders of Infraact and creators of the open source inference engine VLM.
This is a conversation about the infrastructure beneath AI
and why it may matter more than the models themselves.
We are here today with Simon Moe and Wusa Kwan,
lead contributors on the VLM open source project,
and co-founders of Infraact,
a new AI inference company.
Super excited to have you guys on the show today.
Thank you.
Thank you so much for coming.
We're going to talk a little bit about BLLM, the Open Source Project.
We're going to talk a lot about inference and what inference technology really is.
And then we'll talk a little bit about infract the new company.
So to start, can you talk a little bit about where BLLM came from?
What is it?
How did you start it?
And why is it such an exciting project?
Thank you for having us.
VLM project started from actually Wistook's a prototype project at UC Berkeley
doing his PhDs and grow into today's open source project on GitHub for inference around time for everybody.
Maybe Wilson can talk a little bit about the page attention paper.
Oh, yeah.
So basically, I think it kind of started in 2022 when Mata released the OPT model as open source.
I'm not actually sure how many people actually like remember the model nowadays, but it was kind
of the one of the first like open weight larger language models to reproduce GPD3.
and our lab tried created a demo service to run the model and to you know demonstrate it for the broader audience and yeah like it was working but super slow so i started a small side project to optimize that demo service that was kind of at the beginning
And initially I was thinking that it may only take like a couple weeks to optimize the service and to when.
But it turns out that it actually has a lot of open problems inside the net.
Because this auto-regressive language model is pretty different.
Actually, it was pretty different from other traditional like ML workloads.
And it wasn't actually, it was kind of like a brand new.
And this like outside this Frontier Labs back in the day, I started to work on it.
And it became a research project and we wrote a paper.
And it even became like an open source project, pretty well-defined open-source project,
as more and more people got interested in it.
So, 2022, this is pre-GPT4, obviously.
Yeah, pre-ChatGPT.
Yeah, pre-ChatGPT.
Yeah.
And you're thinking like, oh, I'll just like work on this inference server.
This should be a fairly straightforward problem.
Like four years later, actually you're like doing more work instead of less.
Exactly, exactly.
Yeah.
Why did you think this is a meaningful problem to work on at the time?
Because, like, I would say most people in the world at that time saw GPT3 as a curiosity in some sense.
And OPT was kind of like a curiosity attached to a curiosity in a way.
Like, what made you and your lab mate sort of excited to work on this back then?
I think I also started from curiosity.
I didn't really think it's the most important problem in the world back in the day.
I just wanted to have a hands-on experience on how this actually works.
I mean, I think I'm also impressed by the size of the model.
The OPT largest model has 175 billion parameters.
And that was the largest model available.
So it's kind of like a meaningful for me.
Like it was kind of pretty rewarding to work on such a large model.
This reminds me of when like when I was like growing up, you know, we would build like computers.
That was like the cool thing to do.
And each step change in like memory capacity was such a big.
I was like, oh my God, this one has four megabytes of RAMs.
Oh my God.
This one has 512 megabytes of RAM.
Looking back, it's silly.
But at the time, that was actually like maybe it's because we're like nerds.
But like it gets, you get like emotionally excited about like.
numbers getting bigger on these systems.
Right, right, right. Yeah, I think that was like one of the main motivation clearly.
And so you started to say the sort of technical problem is different for auto-rogressive
transformers compared to traditional machine learning. Do you mind explaining a little bit,
you know, how that is? And even compare just to normal kind of computing workloads for,
you know, listeners who, you know, our engineers may not be familiar with AI workload.
So basically compared to the traditional workload, you know, the clear difference is definitely like
GPUs, right?
Now all the computers or kind of most of the computer happening on GPU.
And we have to optimize for the, which presumably have last memory than CPU, at least back in the day.
Now the GPUs are much, has much larger memory, but typically it has much smaller memory than CPU, maybe still.
And like, you know, like all the computations happens on GPU.
So you have to write program in a different language and a different type of parallelism in mind.
Yeah.
So that's kind of like a fundamental different.
from the traditional compute happy workload
versus deep learning workload, I would say.
But within the dim learning workload,
there's actually still a huge difference
between the kind of traditional
dim learning workload versus like larger language model inference.
So for traditional workload,
I think the biggest kind of characteristic is that
it is pretty static,
which means, for example, like for image models,
like back in the day, like CNNs,
what people do is, you know,
we may have several images with different sizes.
Then what we do is we resize them or crop them into the same size,
and then we batch them,
and then we put it to the model to run the inference at once.
And this is basically, yeah, because of this resizing and propping,
like they all kind of, at the end,
they're kind of compressing to the same size tensor,
and that actually makes things much simpler for the GPU to handle, right?
all the shapes are pretty regular, static, and it's kind of like well-defined.
But for a large language model, if you think about it, they're pretty dynamic.
Your prompt can be either like hello, like a single word, or your prompt can be like a bunch
of like documents spanning like hundreds of pages.
And this kind of like dynamism exists inherently in the language model.
And this makes things a whole like kind of in the different world.
We have to handle this dynamism as a first-class citizen.
And, yeah, back in the day, that was not like, people didn't have a clear idea about how to handle it.
And, yeah, fortunately, we were one of the first to solve a, to see the problem.
That's very interesting.
So kind of regularizing a batch of inputs was, it sounds like one of the first problems that you had to solve.
It's actually more about scheduling and memory management.
Yeah, yeah.
as well.
Yeah.
Yeah.
So the problem we're solving before in all the serving system is about just what we call
micro-batching to leverage first CPU's fundamental like vectorization in the early days before
LMs.
And then early GPU for vision models like Resnet is all about micro-batching.
You put together four requests together that arrive around the same time.
But the change in the LM world is you always have requests that continuously filling and coming
in.
and then each request looks differently.
You just cannot really normalize them.
So that's why you have to have a notion of a step within the LM engine
to process one token across all the requests at the same time,
regardless of each request having different kinds of input lens and output lens.
Nowpoo is also non-deterministic.
The language model itself will decide when does it stop.
Instead of in the traditional sense of other machine learning servings,
it's very much like work like a clockwork.
And here it is very sarcastic.
it's always flowing, it is always continuous.
That's why scheduling is the first problem to solve.
And then memory management, that's where page attention come about,
is the second problem to solve.
So when did you get involved in the project, Simon?
Well, I got involved in around 2020, 3.
I first, Wussuk issued a call in the Skylab Slack channel to say,
hey, we need someone to work with us on this page attention paper and kernel.
Actually, surprisingly, I was on spring break.
and I was like, look, someone else can do this.
And then we just play with GBT for the entire week.
So I would just end up just playing with prompt engineering.
So I actually didn't end up joining with Lusuk.
And so this is what a vacation looks like in Janstoyka's lab,
playing with models for a week.
And he's playing with kernels.
Yeah, exactly.
So he's playing with kernels.
I was trying to build more prompt engineering and explore like different kind of
early agentic workflow.
And then over the summer, and especially this is when around August and September,
and we really get to work together.
Actually, this is where you come in.
We get to work together on our very first VLM meetup, ACQNZ,
and where I had the experience of managing open source project before,
as well as deeply interested in actually building a serving platform
and into a fully open source project.
And this is where I start to get involved,
roll my first lines of code,
and sort of build up the CS system,
and built as a performance benchmarking systems,
and then really much work with Wooslok ever since.
I had forgotten about that.
So this was the very first VLM meetup, right?
Yeah, I was in this office.
In this office.
I'm the exact floor.
I think we are previously anticipating just 10, like 10, 20,
maybe 50 people showed up,
and then the registration was like exactly over the anticipated capacity.
People are extremely interested in this technology.
I remember that very well because we run events here
for ourselves. And it's always very hard to get people to show up. We're always scrambling. And instead,
I got a call from our security team saying, too many people have been approved for this field and meetup.
We need to scale it back. This isn't safe. I'm like, oh, okay. Probably don't tell. I don't think we
ever scaled it back. So don't tell the security. It was quite crowded. The piece I ran out,
like the first 10 minutes. But this is a big deal, right? Because this is not like a consumer app,
right, that you were building. This is pulling from systems engineers, right? For the most part, who want to
learn about how to serve LLMs and contribute to it.
So it's actually a big deal to get, I think, so much interest from such a kind of narrow,
sophisticated group of people who don't like meeting other humans in real life that often
either, you know, at least speaking for myself.
So can you talk a little bit more about the community behind VLM?
Like, how big is it now?
How did it come together?
And like, how do you guys manage it as it's gotten big?
Yeah.
So in the beginning, of course, it's just a few grad students working on it.
And then so, and by over time, we started to,
having this very much an open-minded, developing the open kind of mindset.
So as of now, we're looking at 50 or more regular full-time contributor
who open up GitHub every single day to work on VOLM,
way across 2,000 contributor bars on GitHub,
one of the fastest growing top open-source project ranked by GitHub itself.
And then this is really a diverse community.
So there is folks like Usuk and I are sort of the team from UCB,
Berkeley from grad student days and as well as meta and red hat pulling their way behind this open
source project and then as well as of course people who are not just make people who are making the
model bestroll and quen team and of course like anyone who's making open way model are participating in
our community and then on the model side invidia amd google a2s intel they're all having their
own participation and be able to support the ecosystem so everyone
in Vio, using VAL has the ability to choose about different SETICANS for accelerated computing.
That's very interesting, though, which I think is a property that many successful
open source projects have, which is that people aren't all contributing for the same reason.
Some people, I'm sure, just love the technology.
But it sounds like you're saying the model providers actually have incentives to contribute
to the project because they want their models to run well.
The silicon providers want it to run well in their silicon.
The infra providers want to have first divs on running it so they can sell infra, that kind of thing.
Yeah, this is kind of a classic worth solving the M-times-and-problem so that as a model provider,
you don't need to talk to everybody.
And as a hardware provider, you can just go into this one system, and then magically,
you'll work for all the models out there in the world.
And then for applications who are using VOLM as well as infrastructure, building with VOLM,
like having a common ground where everybody can participate in and then innovate together
is way easier and cheaper, in fact, in the end to deploy.
What's your philosophy for managing a pool of contributors this large? Do you tell them what to do? Do they choose themselves?
How do you maintain high code quality? It's a constant sort of iteration months over months, year after years.
So for this, I have to go back to my previous OpenSouth project, which I was working on a project called Ray and then later any scale, where we have this kind of, where I learned this a community-driven approach in a way that have a clear requirement, have a clear rule of
have a clear sort of milestone being set.
So we kind of try to borrow that,
but also really study this really successful open source project out there.
I went all the way back to NNX and then to study Kubernetes,
study Postgres.
How are these communities operating in together?
So in VOL and we had kind of a special model that we do,
like any normal engineering organization,
set clear team scope,
but also clear objective and results and milestones
with different kind of technologies, technical features we want to push forward and build.
So this is where we have set forward our vision every quarter.
And then, but also invites the community to contribute.
So we're saying, great, we're working on these.
We also need help on these items that we don't have anyone actively working on.
If you are brand new and want to engage with us or engage with the community,
here's what you can work on.
And additionally, we keep an extremely open mind to all the GitHub pool requests that people just opened up
that we're seeing, oh, is this a good request?
Is this a good feature?
And then as well as a request for common processes.
So kind of is a blend of all the lesson learned
from previously other open source project.
And then code quality-wise, code reviews,
but also a lot of constant refactoring iterations.
Yeah, yeah.
I do a lot of refactoring, like, every six months, kind of, yeah.
And actually, one thing to add is, you know,
like we do in-person meetups, you know,
like every two months and we're kind of expanding to globally actually like sometimes in the
Europe, sometimes in some other places in Asia.
Yeah.
And yeah, like we actually from the first META in A-NZ, we learned that it's actually super,
super useful to meet, you know, those like collaborators and, you know, users in person.
And yeah, we are continuing doing that.
It's funny.
It's another one of these lessons that like, you know, Silicon Valley engineers, like we've
gotten so kind of like, you know, high up the abstraction stack that we're.
like relearning, you know, lessons from a thousand years ago, saying, oh, it turns out in-person
communication is high bandwidth and doesn't suffer from consistency problems. So, so around the time
you guys did that first meetup, we also made grant funding to the project through the academic
lab. I think it was a small amount of money, but it was actually the very first open source
grant that we made. So it's super, you know, just like fun and kind of gratifying for us to see like
the money was actually put to good use and the project crew massively. And then we even had a chance
to invest in the related company later.
However, I did hear a rumor that at the time that we made the grant funding
that you guys put a portion of the money into Nvidia stock.
Can you confirm or deny?
No, I did it.
Not him.
So someone else in the recipient list.
So you probably turned our tiny grant into 10 times as much money before it was.
Oh, sort of sort of the funding for VOLM.
A lot of these funding for BLM is that we set aside for project development and sort of
project development, testing, and everything around operating this project.
And once you know we're actually super grateful for the first grant,
it's actually kicked off a culture.
And nowadays, you can get even a tradition,
for people really opened up to sponsor open source projects in a quite significant way.
Because running VOL, our CI bill, for example, is more than 100K amounts.
That could be tiny for some folks.
And it's like overgrowing over time.
This is we're at a burn of million dollar amounts and a year, a million dollar a year.
For an academic project, it's actually very...
Yeah, because we want to make sure every single commits is well tested.
And then this is something that people are going to deploy at not thousands,
but potentially millions of GPUs across the world in different environments.
So we want to make sure it's well tested.
It is reliable.
And then this requirement, this infrastructure, or right now all comes from contribution
and sponsorship and from everybody are chipping into help on this project.
And now, of course, we also run meetups, and sometimes expenses associated with meetups
are directly leveraging the sort of the grants that you all provide it.
Yeah, I mean, it makes sense.
You know, for us and for other corporate sponsors of the LLM, it, you know, it benefits the
whole ecosystem, right?
So I think it makes a lot of sense.
Let's talk more about the technical aspects of the problem, if that's okay with you
guys.
Do you mind to start just defining exactly what?
like an inference server or an inference engine is?
Sure.
So an inference engine turns,
it takes a already-trained model.
So this can be a very small model like Q1B.
It could be a very big model on DeepSQL, Kimmy K2,
run it on a cellular computing device.
And its job is to fully utilize the computing device
to be able to generate text and images and videos,
essentially, but this all got tokenized into individual tokens.
So the goal of inference engine is to produce, the goal of inference engine is to run the model
at highly efficient speed to make sure that we can produce maximum outputs at the highest
efficiency.
And just from a high level, can you explain some of the architecture, how sort of a typical
inference engine works?
What are just the few most important components that people would be interested to learn more
about?
maybe one goes through a life of a request.
Like if I say hello, what would happen to VOLM?
Yeah.
Yeah, so basically there's a kind of traditional API server.
Definitely, you know, guess the retest and maybe,
and once the model generates output, it streambacks the tokens one by one.
Yeah, so there's like definitely a traditional API server layer.
And inside an internet, we have kind of typically something called tokenizer, right,
like to transform this like inputs to like the tokens,
basically some integers, the least of vintage.
that the language model can consume.
And inside of the engine,
what we call like engine,
and that includes a scheduler to,
you know,
which decides how to batch the recast,
to incoming recast.
And we have a memory manager to manage something called KV cache,
which is kind of the core part of the transformer for other limbs.
And we have a definitely have some kind of worker.
This is a very generic term,
but which basically actually initialize the model
and run the model and get the output
and do all the pre-processing for the input
and, you know, post-processing for the model output.
Yeah, so, yeah, that's basically,
I mean, in a sense, it's not like a crazy new architecture,
but each one basically highly optimized and specialized
for this LM inference workflow.
Do you think it's getting easier or harder over time
running inference?
Yeah, definitely, I think it is,
definitely much getting much more difficult over time. Actually, honestly, maybe one and a half
years ago, I wasn't thinking, like, in France, there's a hard problem at all, to be very honest.
But now things have changed. The trend has changed so far. So I think there are kind of three factors.
One is scale. Another is diversity. And the last one is kind of agents. So for scale, you know,
the models are definitely getting larger.
And, you know, right now we have Kimi K2 with like more than 100, more than a trillion
parameters.
But I think we believe we will see like multi-trillion parameter open source model this year.
And I think that's still clearly a trend that people will be training a larger model.
And, you know, definitely it's much more challenging to deal with such a model compared to, you know,
like the only days of at a lens where we just only deal with like small llama models.
And with larger models, presumably you need more nodes working concurrently you need
you have more memory to manage that may or may not fit in each, you know, chips available
memory. You could describe some of the challenges from scale.
Yeah, for these kind of large models, we definitely need to shard, you know,
distribute the model into multiple like GPUs, multiple nodes, right?
And then, and yeah, then there's like definitely like a problem of how to,
chart how to distribute this model, right?
There are actually many dimensions we can use to charge the model,
and they have different trade-offs.
And, yeah, trade-offs, for example,
in terms of how much communication we should pay to share the model in this way.
And also, there's a trade-up in terms of, like, load balancing.
If I share this in this dimension, then how significant is the load imbalance?
So these all need to take into account for the final performance estimation
to get the best performance.
And yeah, it could be becoming more and more
a bigger problem as the models get larger.
And what about just cluster scale?
I mean, I think, Simon,
how many nodes is VLM running on at any given time?
Right now, we're looking at,
this is true, our sort of like a very small
sample of our usage statistics
that's used for us to figure out
what feature to deprecate.
Just literally from this one signal we're looking at
400K to 500K GPUs 24-7 running VLM.
And there's quite a big scale thinking about the global deployment of GPU footprints,
and we definitely believe there's a lot more out there.
And of course, this is a wide diversity of different kinds of GPUs, GPU architecture,
as well as model architecture being deployed.
We're not seeing like a one-size-fit-all.
People are using it for just one singular use case.
I see.
And this is sort of your point.
Your second point was about diversity, sort of making it prints a harder,
problem over time. Yeah, the chip diversity, harder diversity is definitely one factor. And also
models are getting also diverse, you know. If you think about the, like, for example, like for
Media, like a year ago, I think they only released a few series of open source models. But now
they're releasing many open source models like every month in different domains, right? So on the video,
someone on the robotics, someone on the language. And yeah, this kind of like open sourcing trend is
getting expanding and that people are training many different kinds of models in many different
domains and releasing them like every month. So there's model diversity and even for just for
text models, they're all transformers in that, but their detailed architecture still are very
diverse and they're even, we see they're even diverging. Like say for like deep six 3.2 was using
sparsal attention, something called sparsal attention, but say for Q1,000.
and Kimi, they're kind of exploring, like, linear attention, which is kind of different
attention mechanism, and they have different ways to manage the memory. So, yeah, this model
architecture divergence is also getting, getting more significant. And so is it up to you,
as, you know, meaning BLLM to implement all of these, like all the two, you know, implement sparse
attention, for instance, so that it's available for the models to use? Yeah, definitely, we
We basically leverage open source community definitely.
Like we, you know, because we collaborate with these model vendors,
like we often get help from these model vendors.
They basically provide some kernels or at least like reference implementations of, you know,
of these new kind of like operations.
And yeah, we like our job is often like basically leverage this collaboration and making more mature
and also available for more diverse environments.
I remember early on in open.
open source models. There was some standardization. Like everyone was kind of using Lama. I think everyone's
using sort of like the same tokenizer and the same like input format and, you know, and like end of
stream token and stuff like that. Is that still the case or is it like is a different for each
provider now? It is, yeah, it diverged it quite a bit over the last few years, maybe last couple years.
Yeah. One thing is that many, yeah, like the model architecture itself has changed a lot, you know,
especially on the attention side.
And also even for like input output processing,
because like different labs have different kind of their own ways to form,
you know, how to form the conversation
and how to form the tool calls, for example, for their own models.
So now like this has been diverging quite a bit.
And now, yeah, this has been diverging quite a bit for the last couple of years.
I see.
Okay.
So scale of models, diversity of models and hardware deployment scenarios.
And then agents for the third thing you mentioned,
and sort of getting hard over it.
Yeah, yeah.
You know, like for agents, we need a,
definitely we need a kind of different,
I mean, beyond, just beyond the inference engine,
we also need to set up the whole new, like,
environment, a whole new, like, infrastructure
to support all the tool callings
and to support all the, yeah, like multi-agents things.
Yeah, like that part becoming a kind of a new,
like emerging challenge for inference as a whole.
Do you think this means more,
there will be more state managed in the,
imprint slayer over time. As before, the paradigm has been texting, text, out, and then just
single request response. But as we evolve into the year and the decade of agents, we're seeing
multi-turn conversation turning into hundreds and thousands of turns. And then these terms also
involves external tool use, like interacting with sandbox, performing web searches, running
Python script or any programming languages and be able to have this kind of long iterative
process where LM is involved but also external environment interaction is involved.
And this really kicked off a huge wave of co-optimizing a genetic architecture with
influence architecture. So just to give an example that when, just to give an example,
it is very important for VLM to understand whether or not the conversation.
is still happening. If the conversation is no longer happening, we can remove the KV cache.
That is the persistent state associated with each text completion streams. But in agentic
use cases, you actually don't know whether or not the agent will think it finishes or also
wait the interaction previously, and the interaction previously was just a human typing in the
text box. But now it becomes external environment interaction. It could be
one second just for a single script to finish. It could be 10 second for a search or like a complex
analysis to finish. And then it could also be minutes, hours, if there's humans in the loop.
Now, with that uncertainty, we actually don't even know when is the request going to come back.
And then the uniformly of cache access pattern and eviction pattern got kind of, the patterns got
pretty disrupted by the new paradigm. I see. I see. And so you have to be much smarter about
how you manage the cache as one.
As one example of that, yeah.
Gotcha, gotcha, gotcha, which is one of the like unsolvable problems in computer science.
Cache invalidation.
Yeah, so exactly.
So I can see how that would get harder over time.
I think I know the answer to this, but are you guys big believers in open source AI compared to
closed source?
And can you just explain like how you think about that?
We're definitely big believers in open source.
What we believe is diversity will triumph that sort of.
that sort of single of anything at all.
So that means we believe in diversities in models,
diversity in chip architecture.
Fundamentally, this is because the world is complex.
For any application, you're going to need to find
and tailor the right model architecture
to the right chip architecture for that right exact use cases.
And the best way to promote diversity and improve that is through open source.
Because open source, everybody know where everybody else is up to
and be able to make their opinionate take based off the common ground.
And finally, if you look at Israel Computer Science, operating system, cluster managers,
databases, every single system field get better when they're starting to have a common standard
and everybody that deviate a little bit, innovate on top of each other,
versus following a single line of trend that is proprietary and single source control.
I see. That's very interesting.
So you're almost saying OpenAI will tune their stack very tightly for their use case,
which is chat GPT or whatever other apps they're running.
For an enterprise or another tech company,
if I want that same level of tuning,
I like can't just use off-the-shelf close-source models
because I don't sort of have control the whole stack.
And like the different participants in the stack kind of aren't paying attention.
Yeah, of course, one part is data.
One part is the model architecture itself,
which will impact the performance.
And then just on the model architecture,
architecture itself, right? How smart do you want the model to be? Do you want the model to be
able to handle millions of contacts, token contacts, or just shorter context is totally fine, right? And
then you also need to specialize that model to your exact compute architecture. What chip are you
using? For example, for Nvidia, the model you design for a H-100 chip is very different from
a B-200 ship, and then it is very different for a GB-200 MV-L-72 system. And then compared to, for example,
the model architecture you designed for TPU, then again, that is also drastically different.
And then using it for vision model, video generation, and for reasoning mass coding,
in the end, we'll all look kind of look at the vertical stat integration.
We're like, wow, there's so much different from each other.
That makes sense.
Can you just share any stories about live BLLM deployments that you thought were particularly
interesting or important?
I have a few.
One is, I think around 2024, we learned that Amazon is running VOLM to power their Rufus assistant bot,
which was like really surprising to all of us because one as a point, like, of course, like,
we believe VLM can be deployed at scale, but seeing this as a massive scale, like kind of global
ecommerce deploying this as like front page feature.
That means when everybody, when they're opening Amazon app and clicking the,
bots suggestion or even entering a search query is going through a VOM.
And this is kind of the first sort of magical experience in a way.
One of the first experience was, wow, my purchase is going through VOLM right now.
It's kind of exciting, but also scary.
You're like PhD students at the time.
And also across not just Amazon, LinkedIn, and every major deployment of VOM,
we're surprised to find out they're always the first adopter of cutting-edge features.
So I've seen one of the example of deployment of VOM within Character AI was when we first
make the N-Grant speculation for a spectacode available as just a single PR, pull request in VLM,
not even merged.
And then while we're still iterating on that feature, and I heard some from Character AI saying,
oh, actually, we already wrote it out to you hundreds of GPUs at scale given just your first
iteration of this feature.
So it's really much everybody is staying on the cutting edge.
of VOLM and we're quite excited about that.
Yeah.
Okay, should we talk about the company then?
Infrax.
What is Infrax and why did you guys decide to start the company?
So Infraq created by the creators and maintainers of the VOM project.
Our goal is to make VOM the world's influence engine, really push the capabilities on the
open source front and then builds a universal inference layer.
means we'll have the wrong time to power any new model on new hardware for new application,
be able to tailor that to extreme efficiency and support all the AI workload going forward.
And implicit in what you just said is that you're devoting a lot of resources, I think,
to the open source project.
Could you, I guess, is that right?
And can you expand on that?
Yeah, one thing I believe is, I fundamentally believe that open source, especially how
VOLM itself is structured, is critical to.
the AI infrastructure. And what we want to do with Infraq is to support, maintain, steward,
and push forward the open source ecosystem. It is only that VLM when VLM becomes a standard and
VLM help everybody to achieve what they need to do, then our company in the sense have the
right meaning and to be able to support everybody around it. So open source is definitely number
one and in fact something's the only priority of our company right now. You're not supposed to tell
your investors, by the way, that we do believe that open source project is also kind of a secret
weapon in a sense that having this community all work together for this open source, we have
the execution beyond any single entity can have. This is the thing we heard over and over again
that people just tell us, we just cannot keep up with VLM. So that's the thing we're not. We just cannot keep up with VLM.
That's why we're using VOL.
We have our internal team.
We have our internal fork.
We have our internal inference engine.
But open source moves so fast that the only way to stay ahead is adopting.
And that's why we want to make happen.
And in fact, this is exactly why we're staying all in on open source.
That's awesome.
We mentioned Jan Stoica before, obviously one of the founders of Databricks.
He was I think both of your PhD advisors at Berkeley.
And he's going to be involved in Infraact too.
Can you talk about maybe a little bit how he's going to be involved in this company?
And even more importantly, what have you guys learned from him as, you know, his students and about startups and, you know, distribute systems and all this stuff?
Sure.
Yeah.
Yeah, you're exactly right.
Young's both of our advisors.
I have actually worked with Young since 2017 since I was an undergrad working on my first opens up project for serving and then work with him at any scale for my second opens up for serving.
You're just addicted to like Berkeley-based open source of the I serving company.
So as this company, and VOL, Yang is quite involved.
So as a company, he will be a co-founder.
And then as an open source project, he has been advising this project since its inception.
Yang knows open source project, academic project, industry research trend, E&L.
So from what we're working together on, Ian really helps us with both clearly understanding all the lessons.
learned about bringing open source through the final miles of adoption in companies, enterprises,
as well as what is actually happening on the research world.
A Sky Computing Lab over the last few years has produced amazing infrastructure and new research
ideas, and Yang continued to explore a new frontier on that front, and then we're quite excited
to hear that and also innovate on the open source together.
Yeah, and he also helps recruiting a lot.
And, you know, like all he is involved in all of our hiring process.
He basically tells us, I mean, teaches us how to tell, you know, talents, how to,
where to find talents.
These are all amazingly helpful.
So on that topic, what are some of the big problems you need to solve now and what type
of people are you hiring to help you help?
Definitely, you know, the inference at scale is kind of the one of the biggest challenge.
I think in the field, not only for us, but in the field overall.
So we are trying to hire more like a very experienced ML, ML Infra Engineers overall to make, you know, for, for example, you know, how we, what would be the best way to utilize the GB 200, GB200, GB-300, MBL 72 rack entirely for the giant open source model.
Still, I think it's a open problem.
There are definitely some endeavors in academia and industry, but I think there are some like room for improvements.
So, yeah, that's some of our focus at the moment.
Here's my pitch from a computer science point.
Pretty rare if people ask me this question.
That is, if you're working at a vertically integrated company
that have an end product for, let's say, for chatbots, for assistant,
you are working on the vertical size of the problem.
In Infrax, you will be working on an obstruction of horizontal layer.
And this is similar to operating systems.
system, databases, and different kinds of abstraction that people have built over the years.
Operating system, abstracted CPU and memory, databases and file system, abstracted storage devices
and networking.
For accelerated computing, there's a brand new physical device that inference and VLR
abstracted a large part of it for inference-specific work.
Of course, it's training, but we are a singular focus is on inference.
and this necessitates a layer, a software layer,
that abstract away GPUs and a certain computing device for models.
And this is as important from my point of view as abstraction unity build for OS for databases,
which are both fields we're really passionate about when we're Ph.G students, too.
So that's why M.O.S.C. is fundamentally a new system research and system deployment.
So you, here at Infraq will be working on this layer that is not a vertical slice, but a fundamental, but a fundamental runtime and impacting all the future generation of software that will run on a cellular computing device.
And your work will stand from both working with different models, different, and then working with different applications.
and as well as understanding the pros and kinds of different chips,
as well as their whole integrated data center systems,
to be able to figure out, oh, actually, for these, we should build the abstraction in this way.
And we'll constantly remove abstraction, break abstraction,
and build it over and over again,
just like how operating system got innovated over time,
databases got innovated over time,
with a new information we have a hint.
So you will come here to have that.
constant exercise of building an actual widely deployed production system that will be at the
frontier of influence.
And this is what you call universal inference layer.
Yeah.
It's purposely vague in a way, but what we really focus on is going from page attention,
from going from the serving system to the whole runtime you need for intelligence.
We suck, Simon. Thank you so much for being here today.
Thrilled to have you on the podcast, of course.
And we're thrilled to be, you know, working together in the company.
It feels like it's been a few years.
We've already been working together.
But yeah, great to have you here.
And congratulations on getting off to a great start.
Thank you for having us.
Yeah.
Thank you.
Thanks for listening to the A16Z podcast.
If you enjoy the episode, let us know by leaving a review at rate thispodcast.com
slash a16Z.
We've got more great conversations coming your way.
See you next.
time. As a reminder, the content here is for informational purposes only. Should not be taken as
legal business, tax, or investment advice, or be used to evaluate any investment or security,
and is not directed at any investors or potential investors in any A16Z fund. Please note that A16Z
and its affiliates may also maintain investments in the companies discussed in this podcast.
For more details, including a link to our investments, please see A16Z.com forward slash disclosures.
