The Infra Pod - Infra to talk to your AI in real-time (Chat with Russ from Livekit)
Episode Date: March 24, 2025In this engaging episode of the Infra Pod, Tim and Ian dives deep into the world of real-time audio and video infrastructure with Russ D'sa, CEO of Livekit. Russ shares the story of how Livekit ev...olved from an open-source project into a company and discusses the challenges it tackles, including scaling real-time communication, building AI infrastructure, and the future of human-computer interaction. They also explore LiveKit's role in transforming various industries, from customer support to robotics and AI-powered applications.00:22 The Origin and Evolution of LiveKit02:47 Challenges in Real-Time Audio and Video06:18 LiveKit's Impact and Use Cases18:52 AI and Real-Time Infrastructure29:43 Future of Human-Computer Interaction
Transcript
Discussion (0)
Well, welcome to the Infrapod.
It's Tim from Essence and E-Let's Go.
Hey Tim, I'm super excited to have Ross Dassault, CEO of LiveKit on the podcast today.
Ross, tell us a little bit about yourself.
How did you get involved in LiveKit?
Like, what is it? Why did you get started?
What's up, you guys? Yeah, yeah.
Well, we started as an open source project, kind of turned into a company a
little bit by accident, started the company or the open source project
rather a few years ago, 2021, very different time in the world.
We're kind of all stuck at home.
Like we are now.
I mean, kind of we can go outside, but back then we couldn't and, um, everybody
needed real time audio and video infrastructure
because it was the only way you could connect with people
was over the internet using your camera and your microphone.
And so at the time, there just wasn't
any open source infrastructure for doing this
that made it easy to kind of build anything you wanted.
Effectively, like what people were doing
was they were just turning everything
into a Zoom screen share instead of, you know, leveraging the same technology that Zoom has
underneath and embedding that in their application.
So started working on a stack for this called LiveKit and kind of blew up once we launched
it in open source and then had real companies that started to ping us and say, hey, I love
this infrastructure, but I don't really want to deploy and scale it myself.
Can you deploy and scale it for me?
I mean, we can pay you money.
And so we went and raised around and started to grow the team a little bit.
I think it was just three people when we started on the open source project
and we started to grow and built a whole cloud infrastructure
kind of network all around the world for this,
of live kit servers everywhere.
And then, yeah, now we serve a lot of traffic.
I'd say like we serve probably as many concurrent users
as Fortnite on a Tuesday, not Fortnite on a Saturday.
That's like, no joke.
They have a lot of people playing Fortnite on weekends.
But yeah, it's scaled up pretty well.
And I think the interesting thing that happened
in the company's life was maybe at the end of 2022
when ChatGPT, the website came out,
I built a demo where instead of texting with it,
you could talk to it.
I put it out there on, it was called Twitter at the time.
And didn't really get a lot of attention,
but OpenAI ended up finding that a few months later
and we started to work with them on building voice mode
for ChatGPT.
And yeah, now everyone's playing around
with talking to their computers
and it's become like an entire industry.
And my co-founder likes to say that I manifested it
with this demo that I built early on.
But yeah, it's been a wild ride ever since.
And now we're kind of very firmly focused on AI infrastructure.
Awesome.
That's incredible.
I'm really interested.
Actually, for context, I built, while I was at Salesforce in 2012, we built this audio-video
solution based on WebRTC.
So I'm actually pretty familiar with this stuff.
Oh, nice.
Yeah.
There's a video out there, the Synco Salesforce SOS.
We're into it, help customers on mobile phones to do their taxes.
It's a wild ride.
But very early in the days of WebRTC.
I'm curious, what was it about LiveKit and what you had built?
What were the challenges that the average developer was having using WebRTC?
What did you solve with the open source
that the average developer couldn't pick up and do? Like what's the problem that was first initially solved
by LiveKit?
It's a great question.
You know, when we were working on LiveKit,
it was really the power,
like a side project that I was working on
during the pandemic.
And, you know, I've thought about this in the past,
like why did LiveKit end up doing well
when we launched open source?
And I think that there's two key reasons why.
The first one is that at the time
that we started working on it,
everything out there was really purpose-built
for video conferencing.
It was all low code, no code, video conferencing,
kind of drop in, zoom onto your website,
or build your own kind of zoom type of platforms
or vendors that provided that.
There wasn't something that was more general purpose,
like a lower level sort of substrate
that a developer could use in many different ways.
And so I think one reason that what we did resonated
with the developer community was we built it to be very kind
of low level, multipurpose.
The second thing that I think was surprising,
well, maybe not surprising is that at that time,
everybody's low code
video conferencing and where do people do most
of their video conferencing?
They do it in a web browser.
They're having a business meeting, right?
And so all of the kind of commercial providers at the time,
they didn't have SDKs on mobile.
They didn't have like, they didn't support platforms
like React Native and Flutter
and all of these kind of new platforms
that people were building applications on.
And so when we launched open source right out of the gate,
we had a web SDK, a React SDK, React Native SDK,
a Flutter SDK, an Android SDK, an iOS SDK.
And people were pretty shocked that this open source project with just three people
working on it had all of these SDKs on all these platforms right out of the gates when
commercial providers didn't. And so I think it was those two things. It was a combination
of that you could use LiveKit anywhere to build any application and then use it to build
an application, any application across any platform. And that's what I think really resonated
and kind of got this like ball moving for the project
and then eventually for the company.
So I'm very curious because I think this era
of building application or even call AI applications,
at this point, it's actually really hard to even fathom
what is the kind of application
people are even building these days.
Because you look at the actual companies and startups are getting funded,
they're building agents, they're doing all kinds of stuff.
And you have like big model providers doing all kinds of stuff.
I'm just curious like for LifeKit,
because when you started until now,
the AI development has moved so fast.
What are the kind of applications people are building with Lifecare
and the kind of stuff you feel like has like the,
this is the best almost like bread and butter type of apps
you think Lifecare are both suited for.
And has that changed over time, right?
Or has been pretty much the same.
It has evolved over time.
To give you an example here,
like when we started, it was during the pandemic,
AI wasn't a hot thing, and the only
thing that you use WebRTC for was video conferencing, mainly. What happened a bit later was once we
launched LiveKit Cloud, so this horizontally scaling mesh network system of LiveKit servers
all around the world, all of a sudden we could now handle large scale. So this was something else that
even the commercial providers
at the time couldn't handle, maybe except for Agora,
but outside of them, everyone was capped at like 50
or a hundred people in a session.
We could handle a million people in a session.
And then suddenly all these live streaming companies
started to use us instead of using a CDN
because a CDN isn't actually truly real time.
And so that was a new use case that people started to use us for.
And then of course we allowed you to run WebRTC on the backend
so you could build a server program that could consume audio and video streams
instead of a human and then that kind of unlocked the multimodal AI use case.
Along the way we also had a lot of robotics companies using us.
So there were people like Skydio that were building a drone tele-op system
where you could kind of take in a camera feed from a drone,
and then it's running in a police precinct, has this command center,
and they see all this footage from all these different drones.
And then they can tap into one, and then they use data channels over LiveKit
to issue command and control, so to steer the drone around and to change where it's looking and position.
And so I would say that, like, we already had a pretty diverse
set of use cases when we started.
So video conferencing, live streaming after we launched cloud
and then robotics and spatial computing, like, you know, gather town.
You're moving around a world and running into people like NPCs and video games.
People were using us for that.
And then kind of as the AI wave started,
people started to use us for being able to allow
the AI model to see, hear, speak, et cetera.
And so that's kind of been the evolution
of the use cases across us.
And then focusing in on AI specifically,
I think was another part of your question.
So, you know, we work with OpenAI on ChatGPT.
We power CharacterAI's voice as well, and a bunch of others.
I think those use cases are more in this bucket of, like, emergent, like, assistant,
virtual assistant, or agent that can help you kind of use cases.
Let's just call it kind of generative, generically like an assistant type of use cases. Let's just call it kind of generative generically like an assistant
type of use case. I would call those emergent. They're not really mainstream and everywhere.
And you know, as you mentioned, Ian, like 10, 12 years ago, like with maybe even a little
bit more with like Alexa and stuff like this assistant Siri, this assistant use cases has
been around but not very good. And it's getting better now. But I think for voice and video specifically around AI, the here and now use cases that
are already at scale, I think are in two places.
One on the audio side or in the voice side is kind of customer support, telephony, any
kind of use case where you're using a phone
because the phone is already a system where the default input is audio, right?
Like when you're sitting behind a keyboard on your desktop, the default interface isn't
really audio, it's your keyboard, right?
It's text.
But like calling something, some IVR system or calling another human being in a contact
center or calling like a restaurant to make an order.
That's voice native. And so I think AI is very quickly kind of coming into that space and disrupting that space.
On the video side, the space that I see live already being used quite a bit is in kind of surveillance and observation.
So it's not on the video generation side, but on the video computer vision side.
To use an example, LiveKit powers 25% of 911 in the United States.
And the way that that works is effectively someone can call 911 and if they have a big
emergency, they tap the FaceTime button on their iPhone
and they're streaming video through LiveKit
to the dispatch center.
What ends up happening every week is the dispatch agent
actually coaches a person on how to administer CPR
over video and they save a person's life from a heart attack.
Every week, this one person, this happens too.
But what's also happening now is that 911 dispatch
is putting an agent into that actual call paired with the human dispatch agent who's
watching what's happening over video, who's consuming the audio, and then doing things
like helping triage or dispatch out to fire or emergency services or police to bring them
into the session that they can go and then observe and figure out
what's happening as well.
So that's like an example of computer vision.
There's other folks like Spot and Verkada
who are also putting agents that are connected to surveillance
cameras and security cameras and doing things
around computer vision through LiveKit as well.
So I think those are the two kind of support telephony on the voice side and then the surveillance
robotics use case on the vision side where we're kind of seeing a lot of early traction
as it pertains to AI.
Very cool.
So I think it's really interesting that you talk about like the evolution of how you started
with the application builders almost like getting started to build some apps to the
point where OpenAI and Character are using you.
Like that's a huge difference of scale actually.
Yeah.
You know?
So I think one is actually you talk about like the support of all kinds of SDKs and
stuff that's almost like the entry point of how do you actually even leverage you.
But there's also the infrastructure required to build.
And so when we talk about like,
what is the infra required to power open AI?
Like what's the hardest part of getting this right?
What is the infra challenges here to build this?
Yeah, so I would say that there's
a primary infrastructure challenge that might come as
a surprise to the vast majority of developers out there that are working on infrastructure,
distributed systems, even at the application layer.
And that is the internet wasn't actually designed for real-time audio and video streaming.
It just wasn't built for it. It's hypertext transfer protocol, HTTP.
It's not hyper audio, it's not hyper video, it's hypertext.
And it's built on HTTP, which is built on TCP,
and TCP was not designed for streaming real-time media.
UDP was designed for this,
and WebRTC is like an abstraction layer on media. UDP was designed for this, and WebRTC
is like an abstraction layer on top of UDP.
The issue with WebRTC is that WebRTC
is a peer-to-peer protocol.
So it wasn't really built for scale.
Peers send their media, their audio, or their voice,
their voice or their video between one another
over the public internet.
And so there's a latency that is incurred there.
Now, let's talk about kind of the AI use case, right?
You have a peer that is not a human, but you have a peer that is like an agent, right?
Sitting in a data center somewhere attached to a GPU instance.
And that agent needs to be able to get your audio and video
so that it can run inference on it
and then give you a response.
What ends up happening is running these models,
it's pretty heavy weight.
I mean, I know this project Stargate,
it's gonna take a while,
but running these models,
you need like a pretty powerful data center.
And so you just can't spread that compute everywhere,
at least not today.
And so you have compute in a few different places,
but then you have users that are connecting from all around the world, right?
Every corner of the globe.
And so if you're just using kind of vanilla WebRTC,
you're sending your...
The user is sending their audio and video over the public internet,
which is like kind of like the road system of the world, right?
There's ditches and bridges and broken,
I don't know if there's broken streets.
I don't know what that means.
There's an earthquake somewhere sometimes,
and there's a big crack in the road.
It's the road system.
And so when your packets are navigating that road system,
it can take a while.
It can slow down, and then the user experience
gets kind of compromised as a result.
There's lag or latency, or the model just freezes,
doesn't respond to you, or the audio arrives,
and it's not kind of high quality.
So the way it understands what you're saying is inaccurate.
There's all kinds of issues that happen
if you're just using kind of public internet for routing.
And so what you really want to do
is you want to have a network. Think of it as like you have these hubs, which are these data centers
that have heavy compute. And then you want to have like a bunch of tendrils that are
coming out to all the corners of the world. And you're terminating the user's connection
as close as possible to them over the public internet. And then once you terminate the
connection at the edge, you're routing the audio and video
over the private fiber backbone
that kind of connects all the data centers
around the world together to that GPU machine.
And then you run the inference
and then you send the response back
over that private internet backbone
and out at the edge closest to the user
so that it spends, the packet spend as little time
as possible on the public internet. That's
really kind of the game for getting the latency down as low as possible and for building kind of
like this very human-like experience of interacting with an AI model. And so that's really what LiveKit
does is LiveKit has a network of servers all around the world. And then we have this kind of
software-defined routing infrastructure that is measuring connectivity
between your ISP and data center, between your user, wherever they are, and the compute
and between data centers and other data centers.
And then actually trying to, you know, you can think of it as like we're Google Maps
for bytes flowing over a network is we're figuring out the fastest path to get bytes
from the user
to the AI model and then from the AI model back to wherever that user is.
To give you an example, the original voice mode, the first day that it went out, there
was like a user that connected from Kazakhstan.
And I was like, okay, you know, like OpenAI doesn't have compute close to that user.
And that's how they can kind of benefit from this network infrastructure
where we kind of cut down the latency significantly,
getting that information from Kazakhstan to 2Chat GPT.
And so yeah, that's one big scaling challenge
is building a network like this.
We've been working on it for about two and a half years.
It's just a lot of things go wrong with networking.
This is what Zoom's been working on this stuff for like 10 years.
And so that's one big challenge.
And then, of course, the scale of these applications
is another big challenge.
And so that's another big infrastructure difference
I talked about at the start around the internet
wasn't designed for audio and video.
So we built this infrastructure on WebRTC
and this network infrastructure.
But then the other kind of difference
from traditional web applications and scaling those
is that this connection that you have between the AI model
and the user is stateful.
It's not stateless.
So I might be talking to chat GPT for two minutes.
It might be for 20 minutes.
It might be for two hours.
And I have this agent that is kind of dedicated to me and stateful.
It's listening to me all the time for as long as that session runs.
Right?
Like when I call a customer support agent, a human, they're not, you know, talking to
five people at the same time, unless they put me on hold and then go to the next person.
If they don't put me on hold, they're talking to just me.
And that's the same thing with an AI model.
It's different from a web application where in a web application, every request and response
is roughly uniform workload.
And you can kind of like round robin load balance across a bunch of servers.
You can't do the same thing with kind of a stateful application.
You have to, it's a different load balancing mechanism where you have a pool of these agents
and you take one pool of these agents and
you take one out of the pool that's not busy and you connect it to the user and you make sure you
can health check it and handle reconnections and all of that stuff and basically manage the life
cycle of it and then you put it back into the pool when it's all done and then it's freed up to have
the next conversation. And so scaling that system, it's just a completely different paradigm than
load balancing across a web application. So that's another challenge that we've had to
solve working with some of these large companies.
So in building this network and this infrastructure at the basic level, are you racking stacking
servers in different countries? Are you signing peering contracts with fiber providers? Like,
where's the bottom of the live kit stack in terms of the level that you're operating
in order to build this network?
Are you a cloud flare?
Help us understand sort of the depth of infrastructure you're building and how much of that vertical
stack you're managing to create these experiences for people.
Yeah.
So I think the answer is, are we a cloud flair for AI?
I would say not yet, just because we write software that runs on top of commoditized
hardware in data centers.
So we leverage cloud providers.
But I would say a big difference between us and everyone else that is in this space is
two things.
One, I think over time, we will expand to build our own network at the hardware level.
The only way to really have kind of full control and reliability and observability over
every issue that can go wrong with networking.
I mean, the reason why Zoom is so good is because Zoom controls everything kind of from
soup to nuts for a startup that's just not feasible from a CapEx perspective.
It's a bit of a chicken egg. Like a CapEx perspective, it's a bit of
a chicken egg. You need the traffic, the utilization to rationalize going and building your own data
centers. And so there is a pay to play actually here, where you have to have a certain number of
data centers and a certain number of points of presence around the world just to give a good
user experience where people will trust you and use you. But then over time, yeah, you can switch
your own data centers. And that's how we set up LiveKit Cloud from the very beginning.
So the first decision that we made, and this is kind of getting into where are we in the
stack and like, what kind of stuff did we have to build at what level of depth?
The first thing that we did was we decided that we were not going to build on AWS.
AWS one is very expensive, but two, they're expensive because they have a really, really
good network and a very wide network. The tricky bit with building on a really good network is that
you are actually insulated from a lot of problems that happen by AWS's kind of management of their
service. They hide a bunch of the issues underneath. And if you ever hope to kind of roll out your own
data centers one day, you have to know what those issues are so that you can
mitigate them in software.
Because once you slot in your own data center, you're going to
have a bunch of issues, and you might need to take that data
center out of rotation without taking downtime, or your
application developers on top of you taking downtime.
So what we said was we're going to build a multi-cloud system,
so we're not going to run on one data center provider.
We're going to run across a blend of them.
So Oracle and Volta and GCP and Azure and Linode, all of them together.
We're going to run one overlay network.
Different providers have better points of presence or better quality in different regions.
So we're also going to measure these networks or measure the kind of peering agreements and connectivity between each one in
different areas and make real-time routing decisions whether to take a
certain provider out of rotation, whether to spin up more capacity on another
provider. So we have this kind of software system that is doing this in
real time and making these decisions and what that allows us to do is, let's say that we
have a lot of utilization in Singapore, just to use an example. We can go to Singapore, you know,
we can go and partner with a company like an Equinix or someone else and effectively slot in
our own data center in Singapore and start to run traffic through it and get that to a point of reliability if it goes down or if there's an issue, the software will just automatically route around it.
But we can kind of over time start to build our own kind of hardware network in piecemeal, not having to swap the entire engine out all at once, you know, mid flight. I think that's like a tricky thing to do. And so in a nutshell, what we're doing is we're,
we're effectively pushing as much as we possibly can into the software layer.
But then over time we will start to kind of region by region build out our own
hardware layer underneath. So short answer,
I already gave you the long answer,
but short answer to your question is we run on
multiple cloud providers now and over time we will start to kind of roll out our own data centers where it makes sense and where our utilization makes sense to do so.
That was a great answer. I mean I have had actual conversations with some of the mutual people we
know, something like on your cap table, about this sort of fundamental challenge infrastructure.
I have a lot of personal companies that I know right now
that are doing this sort of like,
well, we all laughed at like two years ago,
like cloud repatriation,
but they're actually racking a stack in their own servers
because they're not going to the scale
where like the margins matter.
And for you, you're in the situation
where it's actually getting the points of presence
in the peering and egress fees low enough and the bandwidth and the quality, it's so important.
Also, that you have this other completely separate modality of traffic.
It's real time, it's streams of data, you can't round robbing load balance in the same way we
think we like horizontal scaling. I'm really, really curious. I have what might be a spicy
question to ask you. I said at the beginning, in 2012, I did a lot with 2013 or something. It was a lot of WebRTC stuff. At the
time, there was basically like one or two WebRTC vendors in the world, and there were a bunch of
telecoms building live streaming with WebRTC on top of their telecom network. Another obvious
vendor in this space that has like some telephony type stuff would be Twilio.
What is it about live hit and what you're doing, what you're building, enables you to like be so
much more successful with your audio video approach? Like this network you've built for
basically giving bits and bytes for live streams across the network. Then those providers, like
is there some secret sauce? I mean, you may, you tell us or not, like, I'm just very curious to
understand. You're clearly winning here. Those other vendors
exist. What's the delta between what you guys offer?
I'd say there's two parts to answering this question. The
first thing that I would say about LiveKit in general, okay,
so one thing that I think is has led to our success in the space and growth
is that we were not kind of WebRTC or SIP, like the telephony protocol. We were not WebRTC or SIP
engineers.
There's like a bunch of folks in that space
who have been around the space for a long time.
They're all very smart.
I know quite a few of them, wonderful people,
but we did not come at it from that angle.
WebRTC is a protocol that already kind of handles
a lot of the media routing.
SIP has been around for a long time.
They figured out a lot of these fundamental problems
around codecs and compression,
and how do I get the bytes from one end to the other,
and all of that.
That stuff has been around for quite a long time,
as you mentioned.
The part that hadn't really been figured out
when we kind of got to this space and took a look at it
was how do you deploy and scale this?
The LiveKit team, I mean, we definitely have audio and video, real-time streaming experts
on the team, but the other set of folks that we have on the team are extremely good at
distributed systems and scale, right?
Have done this at Meta, Amazon, other places.
And so we kind of took a distributed systems viewpoint
to this problem.
And that's what I think no one else had done at the time
because real-time audio and video was a niche for so long.
It was this niche category and you were using it for like,
up to 50 person meetings and like,
you didn't really need to support massive, massive scale because it just
wasn't user demand for massive, massive scale.
And now there is.
And so I would say that one, the team demographics are aligned with kind of
where the broader macro is going and the needs that come with the user
demand for these applications.
The second thing, and it's tied to what I just said, the second thing is,
you know, maybe you have a spicy question.
This is maybe my spicy take on it without being disparaging at all.
Like I have a lot of respect for Twilio and the incumbents in the space.
But I would say that timing makes such a huge difference.
And I just think that they started at the wrong time.
Not that that was in their control or anything. Maybe they saw the future before everyone else saw
the future, but they saw it 10 years before that future actually came to pass. Twilio built
its business on PSTN and SMS. And yes, PSTN and SMS are everywhere, but they're like cash.
They're kind of dying protocols.
They're gonna be around for another probably 50, 70,
a hundred years, I don't know how long,
but you know, cash is gonna be around for a long time too.
But you know, trajectory wise, they're not growing,
they're shrinking.
Not saying anything about Twilio,
I just mean the use of kind of PSTN and SMS.
And so, you know, newer protocols are out there
and people are connecting increasingly over the internet
versus over these older networks.
You've even seen this kind of in the way Twilio's moved too.
Right?
They saw a lot of growth and they've kind of pushed
into the customer engagement, customer support
kind of vertical, primarily because they have to, right?
They went public, they're on this treadmill
of growing revenue and I got to get into a market
that has a lot of money and you know,
you got to push into a vertical.
There's that pressure to push into a vertical.
And I think for us, we're kind of going horizontal
because the market is also expanding so quickly in AI,
where we're now building like an AI infrastructure company.
I kind of joke, I haven't said this on any interview before, but maybe about six months ago or eight months ago, I was in
a board meeting and with, you know, some friends, investors that we share as
friends. And I said, like, live kid is building AIWS. And now that's actually
kind of become a real thing internally, like I'll get texted and they're like,
I'm excited for AIWS, but we're really kind of spreading beyond, you know, if you look what Twilio did, Twilio
kind of builds infrastructure that makes it easy to tap into the telephony network, Cloud
Flare builds infrastructure that makes it easy to tap into like their edge network.
We're kind of moving beyond just transport. We're moving into like storage and compute
and over time, you're going to see LiveKit
kind of have multiple offerings across these three primitives.
But all of those offerings are really focused on this next wave of applications that are
going to be built that have kind of AI or LLMs at the core of them.
And so I think, yeah, it's just a timing issue.
That's the main problem.
Cool.
Well, I think we're all built all the way up here to get into our spicy future section.
Spicy future.
Giving you, just call yourself AIWS as the new AI jassy.
Maybe you can tell us, what's your spicy hot take of the future?
And I imagine this is gonna be something
to do with AI infrastructure here.
So give what your take, sir.
So my take, I don't know if it's a spicy take.
I don't know how spicy it is.
Maybe a lot of people agree with it now.
You know, I'll tell you,
so when we were raising our A round,
this is about a year ago,
I went and like literally gave this pitch
that I'm about to share to,
I don't know, like 20, 25 VC firms. And everyone told me like, no way, this is like five to 10
years out. I'm not investing, timings off. And I don't know if it's even going to pan out this way.
What I said, my take was that the keyboard and the mouse are going away. That the predominant interface to a computer is going to be a camera and a microphone.
The computer needs to be able to see you and it needs to be able to hear you and it needs
to be able to speak back to you.
If you think about like the way computers have evolved over time, they've been pretty
dumb, right?
They went from like punch cards.
You literally got to feed this thing like a punch card, like a scantron sort of.
They went from being pretty dumb where you had to adapt
your behaviors to the computer.
You have to use this 40 kind of layout of keys on a keyboard and this mouse to give
the computer information so that it could do work for you.
And then now what's happening is that the computer is getting smarter and smarter.
It's starting to behave more and more like you do, right?
It's starting to think more like you do.
And when a computer can do that, when it becomes more like a human, increasingly indistinguishable
from one, how are you going to like actually interact with that computer, that synthetic
human being, that AGI?
You're going to interact with it in the most natural way that synthetic human being, that AGI, you're going to interact
with it in the most natural way that a human interacts with other human beings. And that's
using your eyes, ears and mouth. And the equivalent for a computer is cameras, microphones and
speakers. This is going to especially be true. The other part of that pitch that I give to
these firms when you're raising our A was there there's gonna be humanoid robots walking around everywhere
and you're not gonna walk up to it
and start typing on its chest or its back or something.
Maybe to override it or something.
If it's, you know, you try to,
you press a button and then you override it,
it's in sci-fi movies, they do stuff like that.
But you're gonna walk up to these things
and you're gonna interact with it in a very human-like way.
And I'd say like a year ago, you're gonna walk up to these things and you're gonna interact with it in a very human-like way and
I'd say like a year ago most people didn't believe that then the GPT-40 demo happened
and then a lot more people started to believe that that was true and I think we're I
Think less and less folks would would bet against that. So I don't know how spicy it is now It was pretty spicy a year ago though
I mean, I think it I still think it's spicy in the sense that there's still like an open
fundamental question. Like the mode the UI humans are so good at leveraging a UI to communicate
information. Yeah. And is it actually faster to use a motion keyboard versus like to use
your, you know, to speak, I think is the fundamental question I have,
to convey information.
I actually think for pro users like me,
who have not to brag,
it's not a brag, well, like 150 words per minute or more,
because you've been using a keyboard,
because you've been a nerd
since the inception of your life, right?
Yeah.
That's the final question.
But I have actually a more interesting question to ask you,
which is it seems like your fundamental bet is that model size continues to grow
and precludes the ability for models to move on to the device.
Maybe I guess I should specify what that maybe applies to.
Yeah, I mean, because your whole like just the way I understood your statement is like,
the broad statement here is like,
we're gonna have a lot more video,
we have a lot more audio that needs to be processed,
the process is happening in big data centers
that are not close to you,
we built the network to move the audio video,
how do I invest?
Like, let me buy, this makes sense to me.
And then the last question I had is,
okay, this means we have to have to have some sort
of formal thought process around the future
of like where models will actually run for inference time.
Is it going to be localized or is it going to be centralized?
And I mean, so you must have some view on that
in terms of your thought process.
Definitely.
I kind of also want to answer your question
about humans being used to UI, but.
I mean, I'd love both.
Yeah. My 22ndnd and we'll also
get an SPV set up for you for for the investment but the 22nd answer to the
humans are good at AI at UI is it gonna save me time I know you I don't know if
you saw the operator demo yesterday but you kind of look at it and you're like
this is really cool at least for me I'm like I can see where the future is going
it's really cool but today it's faster for me to just do it.
I would look at, this is kind of spicy too,
I would look at what rich people do.
Like rich people have assistants,
so they just like tell them what they want.
And then the assistant just goes and does everything.
You can kind of think of like where AI is going,
is that everybody is going to have like an executive assistant
that just does everything for them
and knows all their preferences and takes
care of a lot of stuff for them.
So they don't have to, that just doesn't exist.
Computers are too dumb to be able to provide service like that with any kind of
precision or reliability.
And so we've had to become really good at kind of like navigating UI and doing
stuff ourselves because computers haven't been able to do it, but you know,
let's go 10 years from now.
The computers are going to be super good and they're going to be able to do
something that like a very rich person hires a human being to do for them,
to save them time. It's not, I'm not saying that they're not going to like,
you know, the people aren't going to like use TikTok anymore. Right.
Like you're still going to have your apps,
you're going to scroll through stuff
and you're still going to lean back.
But when it comes to like doing a lot of work
and mundane tasks, I'm going to have a super smart assistant
that is digital that can just do this stuff for me.
So that's my take on that.
It was longer than 20 seconds though.
The part that you asked about kind of like edge models
and like, you know, is our bet that everything's going to be kind of like this massive RoaBomb.
I don't remember the name of that computer in Westworld, but, uh, season
three or four, probably nobody saw those seasons except me.
So.
Models will definitely push out to the edge.
There are going to be models running on the device.
Just to take a humanoid robot as an example, you step off the sidewalk,
like your robot is stepping off the sidewalk into oncoming traffic. You don't have time
to go to the cloud to be like, oh, there's a car coming. Like I got to step back. Like
you just don't have time to do that. You need to have a model that is locally processing,
reacting to things. Human beings have a model in their head that is doing this all the time.
But human beings also do cloud inference.
We take out our phone 40 plus times a day
and we go and we do cloud inference.
We look up information,
we talk to customer support to fix my router.
We do all kinds of things.
Like look up kind of what is the dish
that I can order from DoorDash, you know, nearby for lunch.
We're always doing kind of cloud-based inference, and we're also
doing local inference.
And I think it's going to shake out to be something similar in the future.
There's going to be models that are running at the edge, and there's going to be models
that are running in the cloud.
There's certain kinds of dynamic data that your local model just isn't going to know.
We've talked about this internally at LiveKit before, and one thing I've said is, if we're building the human brain with these LLMs or with AGI, where
is the kind of natural analog to a brain that
has all the information in the world at any one time?
There isn't one.
So how are you going to build a digital one?
Can you even fit all of that information
into a model that runs locally?
My model and my brain runs locally,
and it doesn't have all the information
in the world. And so it's going to be a hybrid is my guess. And then the other part about that
hybrid approach is now we're kind of talking about what is the UX of that hybrid approach.
Two parts where I think LiveKit is going to be relevant. The first one is the model isn't always
going to be able to answer your question. And so it's going to have to go to the cloud
to answer your question. Are you going to wait for the model locally to say, I don't know the answer to that, and then go to the cloud? Or are you just going to parallel
process these things where you're sending the data to the cloud, you're having a local model
run inference, and whoever comes back first with the best answer or a suitable answer is going to
win? That's one world where you can use LiveKit
to kind of run this parallel process
where we're managing the part where you're going to the cloud
and then the local model is doing its thing on the device.
That's one area where I think LiveKit is still relevant
for this use case.
The second area is, and I think especially relevant
for robotics, once you start to embody these models within actual hardware,
is how do you make sure that you have visibility
to what that model is seeing and hearing for security purposes,
for auditability?
Like if your humanoid robot drops and shatters
a dish in your house, and then the person calls figure or Tesla bot customer support and says,
hey, like this thing did this action and it was bad.
How are you going to go back and like review what it did
and verify that those claims are true and use that data,
the recording of that data for training purposes to improve the model in the future?
How do you actually do those things? the recording of that data for training purposes to improve the model in the future.
How do you actually do those things?
That's another area where LiveKit has an application.
It's not for kind of like the back and forth interaction,
but for the kind of recording and simulation
and evaluation and training or improvement of that model,
that kind of workflow is where you can use LiveKit as well.
Cause you always want to be kind of streaming that data to the cloud, even if it's just
for archival purposes, not for kind of online compute.
So that's another area where I can see us being valuable in a hybrid world.
I think we have so much more we want to ask you, but we also don't want to take so much more time from you.
We could easily go over another second episode.
This is so much more fun stuff we can talk about.
Well, last thing. For all our listeners for our Inferpod, how would they go find you or LiveKit?
What are the best ways to go play with this stuff here?
Yeah, so the best way to get started with LiveKit is,
you know, we're an open source project.
And so github.com forward slash LiveKit.
That's where all our repos are.
On Twitter, we're at LiveKit.
And then for me personally,
if you're building with LiveKit, you have questions,
you need any help,
or just wanna kind of shoot the breeze on ideas,
I love to jam on on stuff like that.
So you can hit me up at x.com forward slash DSA.
There's benefits to being an early employee at Twitter.
I get a very short handle.
Super amazing.
Thanks Russ.
It is so fun.
And thanks for being on Unforfied.
Thanks so much for having me, you guys.
It was fun.
Thank you.