The Infra Pod - Infra to talk to your AI in real-time (Chat with Russ from Livekit)

Starting point is 00:00:00 Well, welcome to the Infrapod. It's Tim from Essence and E-Let's Go. Hey Tim, I'm super excited to have Ross Dassault, CEO of LiveKit on the podcast today. Ross, tell us a little bit about yourself. How did you get involved in LiveKit? Like, what is it? Why did you get started? What's up, you guys? Yeah, yeah. Well, we started as an open source project, kind of turned into a company a

Starting point is 00:00:27 little bit by accident, started the company or the open source project rather a few years ago, 2021, very different time in the world. We're kind of all stuck at home. Like we are now. I mean, kind of we can go outside, but back then we couldn't and, um, everybody needed real time audio and video infrastructure because it was the only way you could connect with people was over the internet using your camera and your microphone.

Starting point is 00:00:51 And so at the time, there just wasn't any open source infrastructure for doing this that made it easy to kind of build anything you wanted. Effectively, like what people were doing was they were just turning everything into a Zoom screen share instead of, you know, leveraging the same technology that Zoom has underneath and embedding that in their application. So started working on a stack for this called LiveKit and kind of blew up once we launched

Starting point is 00:01:19 it in open source and then had real companies that started to ping us and say, hey, I love this infrastructure, but I don't really want to deploy and scale it myself. Can you deploy and scale it for me? I mean, we can pay you money. And so we went and raised around and started to grow the team a little bit. I think it was just three people when we started on the open source project and we started to grow and built a whole cloud infrastructure kind of network all around the world for this,

Starting point is 00:01:45 of live kit servers everywhere. And then, yeah, now we serve a lot of traffic. I'd say like we serve probably as many concurrent users as Fortnite on a Tuesday, not Fortnite on a Saturday. That's like, no joke. They have a lot of people playing Fortnite on weekends. But yeah, it's scaled up pretty well. And I think the interesting thing that happened

Starting point is 00:02:03 in the company's life was maybe at the end of 2022 when ChatGPT, the website came out, I built a demo where instead of texting with it, you could talk to it. I put it out there on, it was called Twitter at the time. And didn't really get a lot of attention, but OpenAI ended up finding that a few months later and we started to work with them on building voice mode

Starting point is 00:02:27 for ChatGPT. And yeah, now everyone's playing around with talking to their computers and it's become like an entire industry. And my co-founder likes to say that I manifested it with this demo that I built early on. But yeah, it's been a wild ride ever since. And now we're kind of very firmly focused on AI infrastructure.

Starting point is 00:02:47 Awesome. That's incredible. I'm really interested. Actually, for context, I built, while I was at Salesforce in 2012, we built this audio-video solution based on WebRTC. So I'm actually pretty familiar with this stuff. Oh, nice. Yeah.

Starting point is 00:03:00 There's a video out there, the Synco Salesforce SOS. We're into it, help customers on mobile phones to do their taxes. It's a wild ride. But very early in the days of WebRTC. I'm curious, what was it about LiveKit and what you had built? What were the challenges that the average developer was having using WebRTC? What did you solve with the open source that the average developer couldn't pick up and do? Like what's the problem that was first initially solved

Starting point is 00:03:26 by LiveKit? It's a great question. You know, when we were working on LiveKit, it was really the power, like a side project that I was working on during the pandemic. And, you know, I've thought about this in the past, like why did LiveKit end up doing well

Starting point is 00:03:38 when we launched open source? And I think that there's two key reasons why. The first one is that at the time that we started working on it, everything out there was really purpose-built for video conferencing. It was all low code, no code, video conferencing, kind of drop in, zoom onto your website,

Starting point is 00:03:56 or build your own kind of zoom type of platforms or vendors that provided that. There wasn't something that was more general purpose, like a lower level sort of substrate that a developer could use in many different ways. And so I think one reason that what we did resonated with the developer community was we built it to be very kind of low level, multipurpose.

Starting point is 00:04:18 The second thing that I think was surprising, well, maybe not surprising is that at that time, everybody's low code video conferencing and where do people do most of their video conferencing? They do it in a web browser. They're having a business meeting, right? And so all of the kind of commercial providers at the time,

Starting point is 00:04:41 they didn't have SDKs on mobile. They didn't have like, they didn't support platforms like React Native and Flutter and all of these kind of new platforms that people were building applications on. And so when we launched open source right out of the gate, we had a web SDK, a React SDK, React Native SDK, a Flutter SDK, an Android SDK, an iOS SDK.

Starting point is 00:05:00 And people were pretty shocked that this open source project with just three people working on it had all of these SDKs on all these platforms right out of the gates when commercial providers didn't. And so I think it was those two things. It was a combination of that you could use LiveKit anywhere to build any application and then use it to build an application, any application across any platform. And that's what I think really resonated and kind of got this like ball moving for the project and then eventually for the company. So I'm very curious because I think this era

Starting point is 00:05:34 of building application or even call AI applications, at this point, it's actually really hard to even fathom what is the kind of application people are even building these days. Because you look at the actual companies and startups are getting funded, they're building agents, they're doing all kinds of stuff. And you have like big model providers doing all kinds of stuff. I'm just curious like for LifeKit,

Starting point is 00:05:56 because when you started until now, the AI development has moved so fast. What are the kind of applications people are building with Lifecare and the kind of stuff you feel like has like the, this is the best almost like bread and butter type of apps you think Lifecare are both suited for. And has that changed over time, right? Or has been pretty much the same.

Starting point is 00:06:18 It has evolved over time. To give you an example here, like when we started, it was during the pandemic, AI wasn't a hot thing, and the only thing that you use WebRTC for was video conferencing, mainly. What happened a bit later was once we launched LiveKit Cloud, so this horizontally scaling mesh network system of LiveKit servers all around the world, all of a sudden we could now handle large scale. So this was something else that even the commercial providers

Starting point is 00:06:45 at the time couldn't handle, maybe except for Agora, but outside of them, everyone was capped at like 50 or a hundred people in a session. We could handle a million people in a session. And then suddenly all these live streaming companies started to use us instead of using a CDN because a CDN isn't actually truly real time. And so that was a new use case that people started to use us for.

Starting point is 00:07:07 And then of course we allowed you to run WebRTC on the backend so you could build a server program that could consume audio and video streams instead of a human and then that kind of unlocked the multimodal AI use case. Along the way we also had a lot of robotics companies using us. So there were people like Skydio that were building a drone tele-op system where you could kind of take in a camera feed from a drone, and then it's running in a police precinct, has this command center, and they see all this footage from all these different drones.

Starting point is 00:07:38 And then they can tap into one, and then they use data channels over LiveKit to issue command and control, so to steer the drone around and to change where it's looking and position. And so I would say that, like, we already had a pretty diverse set of use cases when we started. So video conferencing, live streaming after we launched cloud and then robotics and spatial computing, like, you know, gather town. You're moving around a world and running into people like NPCs and video games. People were using us for that.

Starting point is 00:08:06 And then kind of as the AI wave started, people started to use us for being able to allow the AI model to see, hear, speak, et cetera. And so that's kind of been the evolution of the use cases across us. And then focusing in on AI specifically, I think was another part of your question. So, you know, we work with OpenAI on ChatGPT.

Starting point is 00:08:28 We power CharacterAI's voice as well, and a bunch of others. I think those use cases are more in this bucket of, like, emergent, like, assistant, virtual assistant, or agent that can help you kind of use cases. Let's just call it kind of generative, generically like an assistant type of use cases. Let's just call it kind of generative generically like an assistant type of use case. I would call those emergent. They're not really mainstream and everywhere. And you know, as you mentioned, Ian, like 10, 12 years ago, like with maybe even a little bit more with like Alexa and stuff like this assistant Siri, this assistant use cases has been around but not very good. And it's getting better now. But I think for voice and video specifically around AI, the here and now use cases that

Starting point is 00:09:13 are already at scale, I think are in two places. One on the audio side or in the voice side is kind of customer support, telephony, any kind of use case where you're using a phone because the phone is already a system where the default input is audio, right? Like when you're sitting behind a keyboard on your desktop, the default interface isn't really audio, it's your keyboard, right? It's text. But like calling something, some IVR system or calling another human being in a contact

Starting point is 00:09:43 center or calling like a restaurant to make an order. That's voice native. And so I think AI is very quickly kind of coming into that space and disrupting that space. On the video side, the space that I see live already being used quite a bit is in kind of surveillance and observation. So it's not on the video generation side, but on the video computer vision side. To use an example, LiveKit powers 25% of 911 in the United States. And the way that that works is effectively someone can call 911 and if they have a big emergency, they tap the FaceTime button on their iPhone and they're streaming video through LiveKit

Starting point is 00:10:28 to the dispatch center. What ends up happening every week is the dispatch agent actually coaches a person on how to administer CPR over video and they save a person's life from a heart attack. Every week, this one person, this happens too. But what's also happening now is that 911 dispatch is putting an agent into that actual call paired with the human dispatch agent who's watching what's happening over video, who's consuming the audio, and then doing things

Starting point is 00:10:56 like helping triage or dispatch out to fire or emergency services or police to bring them into the session that they can go and then observe and figure out what's happening as well. So that's like an example of computer vision. There's other folks like Spot and Verkada who are also putting agents that are connected to surveillance cameras and security cameras and doing things around computer vision through LiveKit as well.

Starting point is 00:11:22 So I think those are the two kind of support telephony on the voice side and then the surveillance robotics use case on the vision side where we're kind of seeing a lot of early traction as it pertains to AI. Very cool. So I think it's really interesting that you talk about like the evolution of how you started with the application builders almost like getting started to build some apps to the point where OpenAI and Character are using you. Like that's a huge difference of scale actually.

Starting point is 00:11:55 Yeah. You know? So I think one is actually you talk about like the support of all kinds of SDKs and stuff that's almost like the entry point of how do you actually even leverage you. But there's also the infrastructure required to build. And so when we talk about like, what is the infra required to power open AI? Like what's the hardest part of getting this right?

Starting point is 00:12:16 What is the infra challenges here to build this? Yeah, so I would say that there's a primary infrastructure challenge that might come as a surprise to the vast majority of developers out there that are working on infrastructure, distributed systems, even at the application layer. And that is the internet wasn't actually designed for real-time audio and video streaming. It just wasn't built for it. It's hypertext transfer protocol, HTTP. It's not hyper audio, it's not hyper video, it's hypertext.

Starting point is 00:12:52 And it's built on HTTP, which is built on TCP, and TCP was not designed for streaming real-time media. UDP was designed for this, and WebRTC is like an abstraction layer on media. UDP was designed for this, and WebRTC is like an abstraction layer on top of UDP. The issue with WebRTC is that WebRTC is a peer-to-peer protocol. So it wasn't really built for scale.

Starting point is 00:13:16 Peers send their media, their audio, or their voice, their voice or their video between one another over the public internet. And so there's a latency that is incurred there. Now, let's talk about kind of the AI use case, right? You have a peer that is not a human, but you have a peer that is like an agent, right? Sitting in a data center somewhere attached to a GPU instance. And that agent needs to be able to get your audio and video

Starting point is 00:13:44 so that it can run inference on it and then give you a response. What ends up happening is running these models, it's pretty heavy weight. I mean, I know this project Stargate, it's gonna take a while, but running these models, you need like a pretty powerful data center.

Starting point is 00:14:00 And so you just can't spread that compute everywhere, at least not today. And so you have compute in a few different places, but then you have users that are connecting from all around the world, right? Every corner of the globe. And so if you're just using kind of vanilla WebRTC, you're sending your... The user is sending their audio and video over the public internet,

Starting point is 00:14:21 which is like kind of like the road system of the world, right? There's ditches and bridges and broken, I don't know if there's broken streets. I don't know what that means. There's an earthquake somewhere sometimes, and there's a big crack in the road. It's the road system. And so when your packets are navigating that road system,

Starting point is 00:14:39 it can take a while. It can slow down, and then the user experience gets kind of compromised as a result. There's lag or latency, or the model just freezes, doesn't respond to you, or the audio arrives, and it's not kind of high quality. So the way it understands what you're saying is inaccurate. There's all kinds of issues that happen

Starting point is 00:14:59 if you're just using kind of public internet for routing. And so what you really want to do is you want to have a network. Think of it as like you have these hubs, which are these data centers that have heavy compute. And then you want to have like a bunch of tendrils that are coming out to all the corners of the world. And you're terminating the user's connection as close as possible to them over the public internet. And then once you terminate the connection at the edge, you're routing the audio and video over the private fiber backbone

Starting point is 00:15:28 that kind of connects all the data centers around the world together to that GPU machine. And then you run the inference and then you send the response back over that private internet backbone and out at the edge closest to the user so that it spends, the packet spend as little time as possible on the public internet. That's

Starting point is 00:15:46 really kind of the game for getting the latency down as low as possible and for building kind of like this very human-like experience of interacting with an AI model. And so that's really what LiveKit does is LiveKit has a network of servers all around the world. And then we have this kind of software-defined routing infrastructure that is measuring connectivity between your ISP and data center, between your user, wherever they are, and the compute and between data centers and other data centers. And then actually trying to, you know, you can think of it as like we're Google Maps for bytes flowing over a network is we're figuring out the fastest path to get bytes

Starting point is 00:16:24 from the user to the AI model and then from the AI model back to wherever that user is. To give you an example, the original voice mode, the first day that it went out, there was like a user that connected from Kazakhstan. And I was like, okay, you know, like OpenAI doesn't have compute close to that user. And that's how they can kind of benefit from this network infrastructure where we kind of cut down the latency significantly, getting that information from Kazakhstan to 2Chat GPT.

Starting point is 00:16:54 And so yeah, that's one big scaling challenge is building a network like this. We've been working on it for about two and a half years. It's just a lot of things go wrong with networking. This is what Zoom's been working on this stuff for like 10 years. And so that's one big challenge. And then, of course, the scale of these applications is another big challenge.

Starting point is 00:17:13 And so that's another big infrastructure difference I talked about at the start around the internet wasn't designed for audio and video. So we built this infrastructure on WebRTC and this network infrastructure. But then the other kind of difference from traditional web applications and scaling those is that this connection that you have between the AI model

Starting point is 00:17:32 and the user is stateful. It's not stateless. So I might be talking to chat GPT for two minutes. It might be for 20 minutes. It might be for two hours. And I have this agent that is kind of dedicated to me and stateful. It's listening to me all the time for as long as that session runs. Right?

Starting point is 00:17:51 Like when I call a customer support agent, a human, they're not, you know, talking to five people at the same time, unless they put me on hold and then go to the next person. If they don't put me on hold, they're talking to just me. And that's the same thing with an AI model. It's different from a web application where in a web application, every request and response is roughly uniform workload. And you can kind of like round robin load balance across a bunch of servers. You can't do the same thing with kind of a stateful application.

Starting point is 00:18:19 You have to, it's a different load balancing mechanism where you have a pool of these agents and you take one pool of these agents and you take one out of the pool that's not busy and you connect it to the user and you make sure you can health check it and handle reconnections and all of that stuff and basically manage the life cycle of it and then you put it back into the pool when it's all done and then it's freed up to have the next conversation. And so scaling that system, it's just a completely different paradigm than load balancing across a web application. So that's another challenge that we've had to solve working with some of these large companies.

Starting point is 00:18:52 So in building this network and this infrastructure at the basic level, are you racking stacking servers in different countries? Are you signing peering contracts with fiber providers? Like, where's the bottom of the live kit stack in terms of the level that you're operating in order to build this network? Are you a cloud flare? Help us understand sort of the depth of infrastructure you're building and how much of that vertical stack you're managing to create these experiences for people. Yeah.

Starting point is 00:19:21 So I think the answer is, are we a cloud flair for AI? I would say not yet, just because we write software that runs on top of commoditized hardware in data centers. So we leverage cloud providers. But I would say a big difference between us and everyone else that is in this space is two things. One, I think over time, we will expand to build our own network at the hardware level. The only way to really have kind of full control and reliability and observability over

Starting point is 00:19:55 every issue that can go wrong with networking. I mean, the reason why Zoom is so good is because Zoom controls everything kind of from soup to nuts for a startup that's just not feasible from a CapEx perspective. It's a bit of a chicken egg. Like a CapEx perspective, it's a bit of a chicken egg. You need the traffic, the utilization to rationalize going and building your own data centers. And so there is a pay to play actually here, where you have to have a certain number of data centers and a certain number of points of presence around the world just to give a good user experience where people will trust you and use you. But then over time, yeah, you can switch

Starting point is 00:20:23 your own data centers. And that's how we set up LiveKit Cloud from the very beginning. So the first decision that we made, and this is kind of getting into where are we in the stack and like, what kind of stuff did we have to build at what level of depth? The first thing that we did was we decided that we were not going to build on AWS. AWS one is very expensive, but two, they're expensive because they have a really, really good network and a very wide network. The tricky bit with building on a really good network is that you are actually insulated from a lot of problems that happen by AWS's kind of management of their service. They hide a bunch of the issues underneath. And if you ever hope to kind of roll out your own

Starting point is 00:21:03 data centers one day, you have to know what those issues are so that you can mitigate them in software. Because once you slot in your own data center, you're going to have a bunch of issues, and you might need to take that data center out of rotation without taking downtime, or your application developers on top of you taking downtime. So what we said was we're going to build a multi-cloud system, so we're not going to run on one data center provider.

Starting point is 00:21:26 We're going to run across a blend of them. So Oracle and Volta and GCP and Azure and Linode, all of them together. We're going to run one overlay network. Different providers have better points of presence or better quality in different regions. So we're also going to measure these networks or measure the kind of peering agreements and connectivity between each one in different areas and make real-time routing decisions whether to take a certain provider out of rotation, whether to spin up more capacity on another provider. So we have this kind of software system that is doing this in

Starting point is 00:22:00 real time and making these decisions and what that allows us to do is, let's say that we have a lot of utilization in Singapore, just to use an example. We can go to Singapore, you know, we can go and partner with a company like an Equinix or someone else and effectively slot in our own data center in Singapore and start to run traffic through it and get that to a point of reliability if it goes down or if there's an issue, the software will just automatically route around it. But we can kind of over time start to build our own kind of hardware network in piecemeal, not having to swap the entire engine out all at once, you know, mid flight. I think that's like a tricky thing to do. And so in a nutshell, what we're doing is we're, we're effectively pushing as much as we possibly can into the software layer. But then over time we will start to kind of region by region build out our own hardware layer underneath. So short answer,

Starting point is 00:23:00 I already gave you the long answer, but short answer to your question is we run on multiple cloud providers now and over time we will start to kind of roll out our own data centers where it makes sense and where our utilization makes sense to do so. That was a great answer. I mean I have had actual conversations with some of the mutual people we know, something like on your cap table, about this sort of fundamental challenge infrastructure. I have a lot of personal companies that I know right now that are doing this sort of like, well, we all laughed at like two years ago,

Starting point is 00:23:32 like cloud repatriation, but they're actually racking a stack in their own servers because they're not going to the scale where like the margins matter. And for you, you're in the situation where it's actually getting the points of presence in the peering and egress fees low enough and the bandwidth and the quality, it's so important. Also, that you have this other completely separate modality of traffic.

Starting point is 00:23:51 It's real time, it's streams of data, you can't round robbing load balance in the same way we think we like horizontal scaling. I'm really, really curious. I have what might be a spicy question to ask you. I said at the beginning, in 2012, I did a lot with 2013 or something. It was a lot of WebRTC stuff. At the time, there was basically like one or two WebRTC vendors in the world, and there were a bunch of telecoms building live streaming with WebRTC on top of their telecom network. Another obvious vendor in this space that has like some telephony type stuff would be Twilio. What is it about live hit and what you're doing, what you're building, enables you to like be so much more successful with your audio video approach? Like this network you've built for

Starting point is 00:24:36 basically giving bits and bytes for live streams across the network. Then those providers, like is there some secret sauce? I mean, you may, you tell us or not, like, I'm just very curious to understand. You're clearly winning here. Those other vendors exist. What's the delta between what you guys offer? I'd say there's two parts to answering this question. The first thing that I would say about LiveKit in general, okay, so one thing that I think is has led to our success in the space and growth is that we were not kind of WebRTC or SIP, like the telephony protocol. We were not WebRTC or SIP

Starting point is 00:25:23 engineers. There's like a bunch of folks in that space who have been around the space for a long time. They're all very smart. I know quite a few of them, wonderful people, but we did not come at it from that angle. WebRTC is a protocol that already kind of handles a lot of the media routing.

Starting point is 00:25:43 SIP has been around for a long time. They figured out a lot of these fundamental problems around codecs and compression, and how do I get the bytes from one end to the other, and all of that. That stuff has been around for quite a long time, as you mentioned. The part that hadn't really been figured out

Starting point is 00:26:02 when we kind of got to this space and took a look at it was how do you deploy and scale this? The LiveKit team, I mean, we definitely have audio and video, real-time streaming experts on the team, but the other set of folks that we have on the team are extremely good at distributed systems and scale, right? Have done this at Meta, Amazon, other places. And so we kind of took a distributed systems viewpoint to this problem.

Starting point is 00:26:30 And that's what I think no one else had done at the time because real-time audio and video was a niche for so long. It was this niche category and you were using it for like, up to 50 person meetings and like, you didn't really need to support massive, massive scale because it just wasn't user demand for massive, massive scale. And now there is. And so I would say that one, the team demographics are aligned with kind of

Starting point is 00:26:57 where the broader macro is going and the needs that come with the user demand for these applications. The second thing, and it's tied to what I just said, the second thing is, you know, maybe you have a spicy question. This is maybe my spicy take on it without being disparaging at all. Like I have a lot of respect for Twilio and the incumbents in the space. But I would say that timing makes such a huge difference. And I just think that they started at the wrong time.

Starting point is 00:27:23 Not that that was in their control or anything. Maybe they saw the future before everyone else saw the future, but they saw it 10 years before that future actually came to pass. Twilio built its business on PSTN and SMS. And yes, PSTN and SMS are everywhere, but they're like cash. They're kind of dying protocols. They're gonna be around for another probably 50, 70, a hundred years, I don't know how long, but you know, cash is gonna be around for a long time too. But you know, trajectory wise, they're not growing,

Starting point is 00:27:56 they're shrinking. Not saying anything about Twilio, I just mean the use of kind of PSTN and SMS. And so, you know, newer protocols are out there and people are connecting increasingly over the internet versus over these older networks. You've even seen this kind of in the way Twilio's moved too. Right?

Starting point is 00:28:16 They saw a lot of growth and they've kind of pushed into the customer engagement, customer support kind of vertical, primarily because they have to, right? They went public, they're on this treadmill of growing revenue and I got to get into a market that has a lot of money and you know, you got to push into a vertical. There's that pressure to push into a vertical.

Starting point is 00:28:33 And I think for us, we're kind of going horizontal because the market is also expanding so quickly in AI, where we're now building like an AI infrastructure company. I kind of joke, I haven't said this on any interview before, but maybe about six months ago or eight months ago, I was in a board meeting and with, you know, some friends, investors that we share as friends. And I said, like, live kid is building AIWS. And now that's actually kind of become a real thing internally, like I'll get texted and they're like, I'm excited for AIWS, but we're really kind of spreading beyond, you know, if you look what Twilio did, Twilio

Starting point is 00:29:08 kind of builds infrastructure that makes it easy to tap into the telephony network, Cloud Flare builds infrastructure that makes it easy to tap into like their edge network. We're kind of moving beyond just transport. We're moving into like storage and compute and over time, you're going to see LiveKit kind of have multiple offerings across these three primitives. But all of those offerings are really focused on this next wave of applications that are going to be built that have kind of AI or LLMs at the core of them. And so I think, yeah, it's just a timing issue.

Starting point is 00:29:42 That's the main problem. Cool. Well, I think we're all built all the way up here to get into our spicy future section. Spicy future. Giving you, just call yourself AIWS as the new AI jassy. Maybe you can tell us, what's your spicy hot take of the future? And I imagine this is gonna be something to do with AI infrastructure here.

Starting point is 00:30:09 So give what your take, sir. So my take, I don't know if it's a spicy take. I don't know how spicy it is. Maybe a lot of people agree with it now. You know, I'll tell you, so when we were raising our A round, this is about a year ago, I went and like literally gave this pitch

Starting point is 00:30:24 that I'm about to share to, I don't know, like 20, 25 VC firms. And everyone told me like, no way, this is like five to 10 years out. I'm not investing, timings off. And I don't know if it's even going to pan out this way. What I said, my take was that the keyboard and the mouse are going away. That the predominant interface to a computer is going to be a camera and a microphone. The computer needs to be able to see you and it needs to be able to hear you and it needs to be able to speak back to you. If you think about like the way computers have evolved over time, they've been pretty dumb, right?

Starting point is 00:31:02 They went from like punch cards. You literally got to feed this thing like a punch card, like a scantron sort of. They went from being pretty dumb where you had to adapt your behaviors to the computer. You have to use this 40 kind of layout of keys on a keyboard and this mouse to give the computer information so that it could do work for you. And then now what's happening is that the computer is getting smarter and smarter. It's starting to behave more and more like you do, right?

Starting point is 00:31:29 It's starting to think more like you do. And when a computer can do that, when it becomes more like a human, increasingly indistinguishable from one, how are you going to like actually interact with that computer, that synthetic human being, that AGI? You're going to interact with it in the most natural way that synthetic human being, that AGI, you're going to interact with it in the most natural way that a human interacts with other human beings. And that's using your eyes, ears and mouth. And the equivalent for a computer is cameras, microphones and speakers. This is going to especially be true. The other part of that pitch that I give to

Starting point is 00:32:01 these firms when you're raising our A was there there's gonna be humanoid robots walking around everywhere and you're not gonna walk up to it and start typing on its chest or its back or something. Maybe to override it or something. If it's, you know, you try to, you press a button and then you override it, it's in sci-fi movies, they do stuff like that. But you're gonna walk up to these things

Starting point is 00:32:21 and you're gonna interact with it in a very human-like way. And I'd say like a year ago, you're gonna walk up to these things and you're gonna interact with it in a very human-like way and I'd say like a year ago most people didn't believe that then the GPT-40 demo happened and then a lot more people started to believe that that was true and I think we're I Think less and less folks would would bet against that. So I don't know how spicy it is now It was pretty spicy a year ago though I mean, I think it I still think it's spicy in the sense that there's still like an open fundamental question. Like the mode the UI humans are so good at leveraging a UI to communicate information. Yeah. And is it actually faster to use a motion keyboard versus like to use

Starting point is 00:33:01 your, you know, to speak, I think is the fundamental question I have, to convey information. I actually think for pro users like me, who have not to brag, it's not a brag, well, like 150 words per minute or more, because you've been using a keyboard, because you've been a nerd since the inception of your life, right?

Starting point is 00:33:20 Yeah. That's the final question. But I have actually a more interesting question to ask you, which is it seems like your fundamental bet is that model size continues to grow and precludes the ability for models to move on to the device. Maybe I guess I should specify what that maybe applies to. Yeah, I mean, because your whole like just the way I understood your statement is like, the broad statement here is like,

Starting point is 00:33:48 we're gonna have a lot more video, we have a lot more audio that needs to be processed, the process is happening in big data centers that are not close to you, we built the network to move the audio video, how do I invest? Like, let me buy, this makes sense to me. And then the last question I had is,

Starting point is 00:34:01 okay, this means we have to have to have some sort of formal thought process around the future of like where models will actually run for inference time. Is it going to be localized or is it going to be centralized? And I mean, so you must have some view on that in terms of your thought process. Definitely. I kind of also want to answer your question

Starting point is 00:34:19 about humans being used to UI, but. I mean, I'd love both. Yeah. My 22ndnd and we'll also get an SPV set up for you for for the investment but the 22nd answer to the humans are good at AI at UI is it gonna save me time I know you I don't know if you saw the operator demo yesterday but you kind of look at it and you're like this is really cool at least for me I'm like I can see where the future is going it's really cool but today it's faster for me to just do it.

Starting point is 00:34:45 I would look at, this is kind of spicy too, I would look at what rich people do. Like rich people have assistants, so they just like tell them what they want. And then the assistant just goes and does everything. You can kind of think of like where AI is going, is that everybody is going to have like an executive assistant that just does everything for them

Starting point is 00:35:02 and knows all their preferences and takes care of a lot of stuff for them. So they don't have to, that just doesn't exist. Computers are too dumb to be able to provide service like that with any kind of precision or reliability. And so we've had to become really good at kind of like navigating UI and doing stuff ourselves because computers haven't been able to do it, but you know, let's go 10 years from now.

Starting point is 00:35:26 The computers are going to be super good and they're going to be able to do something that like a very rich person hires a human being to do for them, to save them time. It's not, I'm not saying that they're not going to like, you know, the people aren't going to like use TikTok anymore. Right. Like you're still going to have your apps, you're going to scroll through stuff and you're still going to lean back. But when it comes to like doing a lot of work

Starting point is 00:35:49 and mundane tasks, I'm going to have a super smart assistant that is digital that can just do this stuff for me. So that's my take on that. It was longer than 20 seconds though. The part that you asked about kind of like edge models and like, you know, is our bet that everything's going to be kind of like this massive RoaBomb. I don't remember the name of that computer in Westworld, but, uh, season three or four, probably nobody saw those seasons except me.

Starting point is 00:36:14 So. Models will definitely push out to the edge. There are going to be models running on the device. Just to take a humanoid robot as an example, you step off the sidewalk, like your robot is stepping off the sidewalk into oncoming traffic. You don't have time to go to the cloud to be like, oh, there's a car coming. Like I got to step back. Like you just don't have time to do that. You need to have a model that is locally processing, reacting to things. Human beings have a model in their head that is doing this all the time.

Starting point is 00:36:42 But human beings also do cloud inference. We take out our phone 40 plus times a day and we go and we do cloud inference. We look up information, we talk to customer support to fix my router. We do all kinds of things. Like look up kind of what is the dish that I can order from DoorDash, you know, nearby for lunch.

Starting point is 00:37:01 We're always doing kind of cloud-based inference, and we're also doing local inference. And I think it's going to shake out to be something similar in the future. There's going to be models that are running at the edge, and there's going to be models that are running in the cloud. There's certain kinds of dynamic data that your local model just isn't going to know. We've talked about this internally at LiveKit before, and one thing I've said is, if we're building the human brain with these LLMs or with AGI, where is the kind of natural analog to a brain that

Starting point is 00:37:32 has all the information in the world at any one time? There isn't one. So how are you going to build a digital one? Can you even fit all of that information into a model that runs locally? My model and my brain runs locally, and it doesn't have all the information in the world. And so it's going to be a hybrid is my guess. And then the other part about that

Starting point is 00:37:50 hybrid approach is now we're kind of talking about what is the UX of that hybrid approach. Two parts where I think LiveKit is going to be relevant. The first one is the model isn't always going to be able to answer your question. And so it's going to have to go to the cloud to answer your question. Are you going to wait for the model locally to say, I don't know the answer to that, and then go to the cloud? Or are you just going to parallel process these things where you're sending the data to the cloud, you're having a local model run inference, and whoever comes back first with the best answer or a suitable answer is going to win? That's one world where you can use LiveKit to kind of run this parallel process

Starting point is 00:38:27 where we're managing the part where you're going to the cloud and then the local model is doing its thing on the device. That's one area where I think LiveKit is still relevant for this use case. The second area is, and I think especially relevant for robotics, once you start to embody these models within actual hardware, is how do you make sure that you have visibility to what that model is seeing and hearing for security purposes,

Starting point is 00:38:56 for auditability? Like if your humanoid robot drops and shatters a dish in your house, and then the person calls figure or Tesla bot customer support and says, hey, like this thing did this action and it was bad. How are you going to go back and like review what it did and verify that those claims are true and use that data, the recording of that data for training purposes to improve the model in the future? How do you actually do those things? the recording of that data for training purposes to improve the model in the future.

Starting point is 00:39:25 How do you actually do those things? That's another area where LiveKit has an application. It's not for kind of like the back and forth interaction, but for the kind of recording and simulation and evaluation and training or improvement of that model, that kind of workflow is where you can use LiveKit as well. Cause you always want to be kind of streaming that data to the cloud, even if it's just for archival purposes, not for kind of online compute.

Starting point is 00:39:52 So that's another area where I can see us being valuable in a hybrid world. I think we have so much more we want to ask you, but we also don't want to take so much more time from you. We could easily go over another second episode. This is so much more fun stuff we can talk about. Well, last thing. For all our listeners for our Inferpod, how would they go find you or LiveKit? What are the best ways to go play with this stuff here? Yeah, so the best way to get started with LiveKit is, you know, we're an open source project.

Starting point is 00:40:28 And so github.com forward slash LiveKit. That's where all our repos are. On Twitter, we're at LiveKit. And then for me personally, if you're building with LiveKit, you have questions, you need any help, or just wanna kind of shoot the breeze on ideas, I love to jam on on stuff like that.

Starting point is 00:40:46 So you can hit me up at x.com forward slash DSA. There's benefits to being an early employee at Twitter. I get a very short handle. Super amazing. Thanks Russ. It is so fun. And thanks for being on Unforfied. Thanks so much for having me, you guys.

Starting point is 00:41:02 It was fun. Thank you.

Your Ad Here

The Infra Pod - Infra to talk to your AI in real-time (Chat with Russ from Livekit)

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.