Cheeky Pint - The world of voice AI, with Mati Staniszewski of ElevenLabs

Starting point is 00:00:00 Madi Stanishefsky co-founded 11 labs in 2022, and has since scaled it to the $11 billion leader in AI audio. He's credited it with capturing the humanness of speech through realistic emotional inflection, and they're now expanding into everything from agentic workflows to music. Thanks for doing this. Thanks for having me. Let me go a place to start is describe to me how, like, I know how an LLM works at a high level, Describe to me how an audio model works. Like if we were Carpathie-style looking to build a toy one from scratch,

Starting point is 00:00:35 how does it work? In early days, you try to replicate it exactly how you would replicate it with the human body. So you'll try to completely try to reproduce a machine, analog machine, that will create a vocal tract effectively. Then that progressed into trying to create effectively like digital signals for speech. Bell Labs was one of the first, to try to create a structured set of signals that will represent the speech.

Starting point is 00:00:58 And that is a first precursor to what we would do today. Then you would try to stitch in phonemes, effectively different sounds of how we speak humans, and then try to concatenate them together. It's another important part in that equation where you would, based on the most probabilistic approach of the next word, you would effectively try to bring the phonemes from your labria phonemes and bring them together. And then down to the modern history where now we effectively do similar like neural nets, in other domains.

Starting point is 00:01:29 So you predict the next sound based on, of course, the context of the previous sounds, if it's a streaming speech. If it's, let's say, a context of audio, you will use combination of predicting of the phonemes, but you also use the contextual text element of that work. And here, credit to my co-founder, Piotch, who effectively came with that new idea of how you can now create voice models, which are both reliable, high quality,

Starting point is 00:01:56 quick, where you would bring a lot of the ideas from transformer models, from diffusion models, into the speech space. Yes. So that prediction of the next token on the phoneme space wasn't something that was possible. You might be, you spoke briefly about this of like how you kind of operate on the text, on the waveform space. There's also mel-spectrogram space. So like usually you do text, melspectrogram waveform. So it was a spectrogram space? It's like a visual representation of how the speech sounds across pitch, across energy,

Starting point is 00:02:22 and then you transform that into a waveform. Got it. So like when WaveNet came along and TACOTO models, they would effectively use text to melspectrogram, so that visual representation, and then how you decode and encode that into the waveform to bring it across. And Piotch figured out how to abstract some of those steps and decode and encode them a lot, a lot better. So that predicting all of the next phoneme was one of the big piece. And second big piece was, how do you bring that context into the equation?

Starting point is 00:02:49 So what I mean by context is, the voice actor was reading a textual copy. You would know that, okay, this is a dialect sequence. I need to produce a dialogue. If it's a happy sentence, I might need to pronounce as a happy sentence. But kind of what happens before and after comes into the equation, and you need to bring that across.

Starting point is 00:03:07 And then there's a last big piece. So voice model has the sound of how you intonate the given fragment. But the second big part is the voice itself of the characteristics of accents, of style, of prosody across that voice. So when you actually try to vocalize something, when you create that voice model, you turn text into audio,

Starting point is 00:03:28 you need the text, you also need the voice reference of how you wanted to be spoken. So here is kind of the second big innovation. So apart from context, it's how you decode and encode those features. So when Bell Labs came with their initial representation of speech, the big piece there was you would have effectively hard-coded parameters for that speech.

Starting point is 00:03:48 With 11Labs models. Hard-go-to parameters for enthusiastic speaker, British accents. Exactly, exactly, that kind of stuff, like the set of pitch elements that you can select, set of energy spectrograms you can select from. And in our approach, effectively, you would give the model open-ended ability

Starting point is 00:04:06 to select what those parameters should be. So it's not going to be British, Polish, Spanish, English speaker, but the model will deduce them themselves. The same for other set of parameters that are not hard-cotted, whether it's the enthusiasm, whether it's the sadness, et cetera. You're saying kind of Britishness is an emergent property.

Starting point is 00:04:24 in your voice models. Exactly. Yeah, and those two big parts. Encoding and decoding of how you create the voice. Super hard problem before and figured out too. How you then construct that in the sense is how you get the context across so you can predict the next phonemes. So how you bring them together in a reliable and stable way

Starting point is 00:04:43 while doing it quick. And these were kind of the two first big innovations in the voice models that continue to today. But okay, so if LLM's reason about text and, you know, words have parts, tokens as the way they think about the world. What is the equivalent of a token in the voice model? You mentioned phonemes a bunch.

Starting point is 00:05:01 Like, what is that representation? So we do, we store the voice embedding effectively for the speaker. So you need that reference when you produce and create the speech. Yes. Of course, in the input to the voice model, you still get the text, and you bring the speaker and coding. And then when you produce speech, you do operate on the waveform or on that, or effectively on the phoneme level of that speech.

Starting point is 00:05:28 And then when we kind of go to the opposite, so of course, what is a phoneme, fill in my understanding? It's like a syllable deconstructed even to smaller elements. And these are effectively like the human sounds you can produce. Got it. So these would be like the most close to that representation. But of course in our models, now it's going to be a combination of not only operating on phoneme level, you also operate on the text level, you operate kind of kind of,

Starting point is 00:05:53 in both in sync because when you are predicting the context, you need to understand how that sentence will get constructed, and especially if it's more of a streaming real-time use case and like a voice agent setting, you need both parts to work across. So it's similar to how you would operate on the token level on the tech side, we operate on the token level on the audio side. It feels like a big part of the magic of 11 was your voices

Starting point is 00:06:17 were much more human-sounding. How did you accomplish that? So I'll kind of give you a quick synopsis of how we think about the models on the text to speech side today. In any model, you need architecture, you need compute, you need data. So architecture innovations were one thing. The data part was the second big thing. With audio, you will have a lot of audio data available, but frequently you will not have it annotated in the right way. You won't have which speaker is speaking when.

Starting point is 00:06:47 some of the what is annotated, but the how isn't. So, like, as we are speaking now, what's the emotions that we use, what are the actions that we use? So we would invest a lot internally on effectively creating our own data labelers, our own team to be able to create those data sets that will be better. And that's a combination of, of course, like semi-automatic techniques, and then manual techniques.

Starting point is 00:07:12 And actually a lot of the models that we did afterwards actually span out from a lot of that research, too. So speech-to-text model, initially was a model we did for ourselves because the models on the market just weren't good to annotate that data. And then another brilliant researcher on our team was kind of being able to construct it

Starting point is 00:07:27 so we could span it out as a model that we brought to the customers. So you've just been doing useful stuff in voice, and that has emerged with a whole bunch of products that you might not have expected because you find you're building useful stuff. Exactly, exactly. And that's kind of combination of data of being able to do it, automatically create a team that's coached on voice,

Starting point is 00:07:46 on how to describe it, because most of the labelers out there just aren't as well versed on understanding the audio and voice helped us a lot to bring that back. And then, of course, deploying those models in production, seeing how customers interact with them, having them annotate all of the data helped us to refine those models over time. A very interesting thing on the side.

Starting point is 00:08:06 So we spoke about the speech representation. The first guy who created the speech representation is a guy called Kempeland, Ron Kempeland. So he created those analog machine that would represent effectively a human vocal tract and try to produce that sound. He'd spent decades on that and that kind of started producing vowels. But that's the same person that created a chess machine, the first viral, let's say chess machine, that would kind of simulate playing chess.

Starting point is 00:08:35 This is a mechanical Turk? It was called Turk. Yeah, yeah. But exactly, but their kind of crazy thing behind it was operated by a human. And it was all a fluke. And that's where the mechanical Turk was. which actually we use a not kind of data labeling production to make that work there. Yeah, yeah.

Starting point is 00:08:52 And so we kind of jumped right in. But if you describe the 11 business today, people think of you as the speech company. How should they actually think of your business? To the extent you can describe the big areas, text of speech, speech to text, voice agents, just like break down the business for us. Cool. So in like the nutshell, I describe 11 laps as a research and product deployment company. we built foundational audio and voice models,

Starting point is 00:09:18 and then build a platform for businesses to transform how they communicate with their customers, with their employees. And that will apply through AI agents from customer support, sales, hiring training, all the way through to marketing and storytelling for our creative tools. And in that set, we've created all types of foundational audio models,

Starting point is 00:09:42 so text-to-speech models for producing, speech-to-text models that work over 100 languages and happily beat others on benchmarks, all the way through to conversational models of how you loop them together, to music, to other domains of audio. And then, of course, beyond the models, when you actually bring them to production, that's where the second level of the platform comes in, where that meets the businesses on the specific use case. So on the agent-specific example, it would be how you now connect those models to the knowledge

Starting point is 00:10:12 base, to telephony, to the integration of the integration. that you need to perform the actions, how you evaluate and monitor the agent that it behaves in the right way, how you build the right safeguards. On the creative side, on the marketing side, it's how do you create a good ad so you can create a good video voiceover

Starting point is 00:10:29 for one of the campaigns, how you create an article that's narrated with a specific voice that represents the brand in a good way. So that's where we combine the models and understanding of the customers we work with into one policy platform. Every platform company has this question about how far they go into applications. So how do you think about where you go horizontal and power the whole ecosystem versus where you develop applications?

Starting point is 00:10:55 Because you can imagine there being a whole ecosystem of closed captioning tools that grow up that, again, are built on the 11 Labs tech. It's not necessarily a space that you would have to go after yourself. I think the big difference between your kind of question, today we see ourselves as a platform where if you're building a horizontal use, in your business, a great place to come. If you have a lot of domain specificity, that's where I see a lot of kind of application companies forming over time, where they will, where that's specifically not the spaces we will go into.

Starting point is 00:11:28 And I think it also is interesting when the tech is moving as quickly as it is here. It's one thing, you know, with SaaS where you get these like vertical-specific providers, but I would imagine one of the biggest risks for you guys in being intermediated is if there's, is, you know, like in this example, a closed captioning service that is on a two versions, old version of 11 labs and hasn't upgraded. That's a problem because you want people to be using

Starting point is 00:11:54 the latest and greatest model that you've developed and you'll be kind of deploying new capabilities every week. And I presume that's part of your thinking is that just when it's moving that quickly, you need to go direct in a lot of cases. That's right. In the close captioning, like, you know, here already now we know that our services is going to be able to tackle like 99.9% of the cases. that customers have. And then there's like added benefit of we work with healthcare customers where we will create custom models for those customers where it will get that transcription perfectly.

Starting point is 00:12:24 The context is the tricky thing in closed captions where like we talk a lot about a lot of technical stuff on this. Yeah, for sure. And that's where you need like effectively like a dictionary of where they do the tag beforehand, which as we work with the businesses, we know we need to embed in that creation process. We're talking about kind of products here.

Starting point is 00:12:45 And one thing I know is that LLMs are amazing, and you have the usage stats of Czech GPT and Gemini and all the popular LLMs where they're working on, people use them a ton. It feels like there's a big product overhang when it comes to voice, where the leading edge voice models are incredibly capable. And yet I was driving home the other day, and I needed to read a PDF when I was driving.

Starting point is 00:13:09 And so I said, okay, I'll just have my phone read the PDF to me. And you can kind of try and hack it with iOS screen reader, but doesn't really work with the scrolling. And then in theory you can upload a Gemini, but you're trying to get it to not summarize it and actually just hung when I tried to press the like, read this to me button. And so there was no way I could get my phone to read me something, which seemed like a fairly basic, you know, feature. And, you know, all cars advertise voice control. And yet it sucks in separately, you know, if you want to, you know, input something to the navigation.

Starting point is 00:13:39 Just no car has a good version of that yet. maybe Tesla does, and an ant. And so, why does it seem like with LLMs and cloud code and everything, we are using all the capabilities of the intelligence, whereas with voice, we're like living 10 years ago somehow? Well, I'm thinking what I agree with the premise that we are 10 years behind. In the lived experience of people, data, like they're using series transcription, which has gotten better, and it's still way behind the leading edge.

Starting point is 00:14:08 Yeah, like, there's definitely a piece. piece of like, I think the technology in many of those cases ready, there's a deployment gap to what you are saying. It's like an automotive or some of the big companies are not adopting that quickly now for bringing that into the production. But plenty of different problems that you need to fix along the way. I mean, the quality of voice models for them to actually sound good, like this is only like last three years thing.

Starting point is 00:14:34 Yeah, there's three years. It's a three years thing. It's a three years thing. All right software updates now. So that's three years for the first voice model that can narrate. Yes. Text, they think. Two years ago, you can start seeing the real-time version of that.

Starting point is 00:14:46 And not really, like, it's, I think the real break was like a year ago, where you could start seeing that in production. And then I think over 2025, the big piece that hasn't been possible is how you connect now the real-time voice interaction with something which I think you're referring to, like, it has context of what you want to do, what is the material that you want to read, how does it connect to set of your preferences from the past and gets that across? I think that's like only recently became possible. and where we've seen, like, kind of the big adoption

Starting point is 00:15:13 across the enterprises leading on the technical side. I think this year, it should be in the automotive site two or some of the applications. Okay, so you think we'll start seeing kind of great voice models in cars this year? This year for their own cloud use cases, like on car, in car, so without connectivity, not yet. There's deployment, of course, gap of, like, how you bring that into the gaps. But I think, like, the next two years, three years,

Starting point is 00:15:41 How about the PDF reading use case? That should work. Yeah. But how should I have done it? So back in the day, I'll preempt this with a story to Q11 Reader. But we had this problem. We have so many audiobook authors come into 11LAPs. So 2023, released first software.

Starting point is 00:16:00 We had a lot of creators and then a lot of audiobook authors or book authors that tried to couldn't afford professional operation and wanted to create an audiobook. However, none of the companies accepted AI audiobooks. You can't sell in the AI audiobooks on Audible or something. Exactly. It's audible with like block AI content. So we had no choice. Like we need to create an avenue for them to bring them. Because there was no distribution for AI audiobooks.

Starting point is 00:16:24 Exactly. So we created 11 reader. And that kind of came of functionality where you can upload your PDF. You can upload your text and have it read out loud with a number of incredible voices. So whether it's Sir Michael Kane all the way through to a state and work. working together with Sir Richard Feynman, where you can have that. And so you were working with the Sir Michael Keynes of the world? Exactly.

Starting point is 00:16:44 And then you can actually read it out of that. And that kind of works extremely well. So that works. Now, how can you do it? Actually, I do want everything read to me by Michael Kane. It's a great voice. Shouldn't you guys have a consumer app where I can just do the common voice things? Like, I want to be able to have an 11 app on my phone,

Starting point is 00:17:01 and then if I upload a PDF to it, it can do the common things that I would like, such as have it read it to me. Yeah, that's exactly your own reader. That's exactly your own reader. So that works. Okay. The phone makers allow third-party keyboards. Do you think they, do they allow third-party transcription engines, will they, do you think?

Starting point is 00:17:17 The phone makers you said, right? Like Apple and Google. Yeah, they are... So they're an OS makers. Yeah, not all of them. Android, with Android you can work through it. It's like, you know, variations of like nothing. That's tech and others.

Starting point is 00:17:29 But yeah, I feel like if you had a popular 11 app that allowed for transcription, people would use it a bunch, and maybe eventually Apple would say, oh, we should allow third-party transcript engines if that's what people want. I mean, it seems like there might be going in that direction, right? And recently they announced that we'll open up the alchemical system. Hopefully they will do the same with voice ecosystem, which is kind of similar. But again, I think it's rational to do when it's moving so quickly. Yeah.

Starting point is 00:17:51 The voice assistant paradigm is one of the oldest paradigm, you know, UI paradigms in computing. Like they open the pod bay door as hell from 1969. Yeah. I will claim it's not working yet. So Siri doesn't have the intelligence. And then on Gemini and ChatGBTGBT and those apps, I mean, I want to use the voice mode, but I don't know about you. It just doesn't work.

Starting point is 00:18:19 And so, like, sometimes I'll be using my phone and I'll use the iOS keyboard transcription to type in the field and then, like, say a bunch of stuff and then send it off. But this suggests to me that consumers really want voice mode that works. and yet it's just not working yet for the major LLM apps or for anyone. Why isn't it working yet? It is pretty hard to do, because you want two things. You want to be able to say things that you want, but you want sometimes for it to execute it, sometimes to wait for you to finish and add something in the sentence.

Starting point is 00:18:52 Sometimes you want it to be interactive, so it asks you questions back to clarify and get some of the additional detail. And all of that is actually pretty hard. Like that's where kind of the magical, like ideal version of a voice agent for us comes free. where you need the speech-to-text element, you need the transcription side, unique, you need then the kind of the turn-taking mechanism.

Starting point is 00:19:10 So like, when do you finish sentence, when is it likely based on silence, based on the context, and then sometimes you wanted to speak back and clarify, or at least give you the tags back to clarify, and then maybe execute set of instructions. So that problem is still very hard research. So I agree with the claim that, like,

Starting point is 00:19:28 this orchestration side has not like passed a true conversational agent Turing test, where it behaves as you would expect from another person, where you can say, that's the simple way of saying what I'm saying, is that we have passed the Turing test with text LMs a long time ago, or actually nowhere near that on voice LLM. So it's kind of interesting how that's a final frontier.

Starting point is 00:19:47 Yeah, I feel like it's going to work in like specific domains, like in customer support call, passes the voice tuning test, works well. And like, let's take another like spectrum of that, an interactive gaming experience, like a truly interactive as you would have with another human, in that game, it's so hard and further out there. We haven't passed it yet there.

Starting point is 00:20:09 Yes, yes. Yeah, but I think that's a combination of, like, even like a simpler version of within that, sometimes you might give a response immediately back. Sometimes you need a tool call to get additional information from the database or how you orchestrate that. So, like, that's probably the most common thing we see as we work with some of the companies out there

Starting point is 00:20:27 is you want those systems to orchestrate extremely well where it's a conversational use case, pretty simple. You can root the agent to speak with, but if you need to authenticate, if you need to pull additional information from the database, what do you do, how do you handle that graciously? That's right. And to that extent, I'll agree.

Starting point is 00:20:45 That's just getting, getting there. And we'll hopefully see that. Our goal is to pass the voice-turing test in all those cases or the Turing test for all conversational agents outside of Voice 2, and I hope we'll all be there in the next year or so. For subscription businesses, a lot of revenue is long,

Starting point is 00:21:04 in that last few seconds before the checkout. Someone has to get up or find their wallet, or they mistyped their card number, or they hit an error, and they just give up and you lose the sale. For a company like 11 Labs, adding hundreds of thousands of subscribers, even a tiny bit of friction like that,

Starting point is 00:21:19 it would really add up. But that's why 11 Labs uses Link from Stripe. Customers save their details once, and then they can check out in seconds across more than a million businesses with save credentials. So, if you want a faster checkout for your customers, you should turn on Link from Stripe.

Starting point is 00:21:34 Are you guys working on personalized voice transcription, where it feels like part of the way we're making it hard for ourselves is when I speak to Siri, I have a bit of an accent, and so it sometimes has a hard time understanding me, but my accent doesn't change. And so it could just get good at listening to John. But my understanding is it's not. It's just like running the global voice recognition model. And I'm guessing it's the same for 11 labs for you. running the global voice recognition model, but again, you have an accent. And so if someone's understanding, like, if you walked up to someone in a coffee shop and said two words, they might have a hard time understanding it because they're not putting it through their

Starting point is 00:22:15 matty Polish accent filter. And so where's this going with, like, actually interpreting the person that you know to exist on the other side? Yeah, I have a very tricky one to detect. So my voice is frequently used in the test. Ah, you're part of the test suite. Yeah. For text to speech, for like, yeah, yeah, it's pretty tricky. But again, trying to parse your voice is. But again, I'm trying to parse your voice in a global model is just making life hard. It's like have a Matty specific model. Yeah. So on the speech to text by transcription, exactly.

Starting point is 00:22:44 Like the big part now that we are bringing in is you have two parts. One, effectively like a person or a voice specific detection, which is true for the accent side, but it's also true for a crowded room. So that's where we have incredible research team that's able to continually do both the accuracy high, but also add things like speaker detection, of course, noise reduction. But the second part is also keyword detection. So there's specific words that you would want to say in those settings

Starting point is 00:23:17 that you want to effectively monitor for. So we spoke about, you know, like let's say I'm going to the coffee shop and order things. The set of actions, like the coffee shop would expect me to do. There's information theory. It's like they can just listen out for the coffee words. Exactly. And then try to like match it to the closest proximity.

Starting point is 00:23:33 So like both things will help in a setup where you have my voice perfect, you can decode it and code it on that. If you don't have my voice or even if you want to double amplify it, we already support effectively a keyword detection, which is useful for like real-time setting and async setting. So back to like Chicky Pine transcription, you could effectively pre-generate that from the previous podcast and look for a set of words that you would use traditionally in that.

Starting point is 00:24:02 And so how hard, okay, so you do the keyword detection already, but how hard are the, I want to get superhuman transcription performance by feeding it an hour of Matty audio before it listens to Matty and that it should be able to do a much better job transcribing? Is that just a really hard research problem? No, solvable. We think we can, we can roll it out in one of the next versions, which is like hopefully in the next month. Oh, so you think this year you're doing person-specific transcription? Person-specific transcription. Like, we can already diarious speakers extremely well. So, like, if we are speaking, can of course dissimulate who is speaking when. Yes. Which is like, in transcription side, apart from accuracy, diarization is one of the harder problems. And we do that extremely well. And now it's going to be like effectively what you're saying, like fine-tuning based on the speaker that I want to listen to.

Starting point is 00:24:51 Yes. Which we know will be important. I mean, like in healthcare setup, such an important part. You're in operating room, you're a doctor, you want to say a command, then you want to really be able to listen to that one person specific piece. You have a hardware device at home, let's say it's a pilot that helps you control the TV. Here too, you will want that to listen to you versus, let's say, the family roaming around. Or maybe you want it to everyone.

Starting point is 00:25:15 So, like, you could decide it, but in many cases, you want to be able to specify that. Okay, that's really exciting. It's great because there's, like, still so many unsolved research problems. Yeah, just like, there's just breakthrough after breakthrough coming in the in the domain of voice models. How about on the flip side, when it comes to speech generation, can you the Zoom touch up my appearance feature?

Starting point is 00:25:37 Yeah, yeah. I've always thought about that in the context of voice, where should you offer a de-axenting filter for voices? Or like, even there's one podcast that I like to listen to, but the voice a little mumbly, and I always thought they should put it through a demumbling filter. Just to, like, make the enunciation a little better.

Starting point is 00:25:57 But all these things, again, like Photoshop, an image. There's no reason that the, like, have you thought about voice to voice, basically, rather than voice to text or text to voice? Yeah, so there are kind of two big parts. One, on the speed generation side, similar, so many evasion still there, there's like a wider piece, and that's like the, we released a V-free model that kind of, we're solving that for the first time, is like, can you control speech?

Starting point is 00:26:23 So you can have the text to speech, you generate something that sounds emotionally great. previously until end of last year, effectively you would rely on model to decide what's the best performance. You could regenerate it, but ultimately model decides the best performance. So that's where the controllability came in, where we can finally give it cues of, say it in a slower way,

Starting point is 00:26:42 or change how you deliver the dramatic pause, or kind of any cues that you give. And to be able to do that, you need the architectural changes and the data that we kind of created over time, where you annotated what was said and how it was said, so you can actually train the model to do that. So today, finally, you can have both speed generation or entire voice agent experience with what we call expressive mode, where the agent knows the emotions on the other side.

Starting point is 00:27:07 So if the person is stressed, it can react and be reassuring. And that's generating a lem response on the reassuring side and response in that set of emotions too. And that break was super hard to do. And that, of course, stretches to a lot of what you said. It could be some version of speech enhancement, either real time or. in the post, a setup to change how that's delivered. And that's relatively recent innovation. And it's like we know it can still be so much better.

Starting point is 00:27:35 Like the edge cases of how you want to describe it is pretty large. So that's one. And then the second part of the question, which is a huge question, that's speech-to-speech models. So as you said, our approach, as you think about voice agent, conversational agents, is effectively a cascaded approach. You use transcription and speech-to-text, L-LM, text-to-speech, orchestrates all of that together. And then you have a speech-to-speech, which kind of goes directly from speech and there's

Starting point is 00:28:01 speech response on the other side. And I say speech to speech, is that the idea that it doesn't go through text as an encoding in the intermediate say, oh, interesting. Okay. For performance reasons, for accuracy reasons? You usually go for latency. Okay. It's faster to run a model that does not have to transcribe and then generate.

Starting point is 00:28:19 Exactly. It's quicker, but on the flip side, you lose reliability. Yes. You look like all visibility into the parts of the pipeline. And emotionality, we think you can deliver both on both sides extremely well, and maybe you can make it more controllable too. So today we are optimizing heavily on a cascaded approach. I'm sorry, a cascaded approach is...

Starting point is 00:28:37 Is the speech or text? Going through the text layer. And as we work with like all of the businesses and enterprises, they will need that visibility into what happens. They will want to execute certain tasks on top of that. They want a good visibility into each of the steps and great accuracy of all the models. But beyond that, they can abstract away what's the LM layer, what's the intelligence layer, the integrations are easier in that system.

Starting point is 00:29:02 So that's like where we are betting a lot of the research work of how you can make that great, and we think we can make that great. And speech-to-speech, as you think about maybe more of like a companion version of the applications, that's where that will flourish because maybe the hallucinations aren't as important, but the latency is a little bit more, and maybe hallucinations are even a feature. And maybe in the future, future, just to finish that part, you will have like some version of combination of the models. That for like low complexity, easy models, you will have speech to speech.

Starting point is 00:29:29 And for like higher their complexity, you will have the cascade it. Okay. So I was going to ask about this. Like, the other way there is research on how the invention of writing changed humans' brains and just like change the neural pathways in ways beyond kind of the actual written language. Do you observe that speech-to-speech models think differently? than cascaded models. Like, it sounds like they're dumber.

Starting point is 00:29:59 They are definitely dumber. You need smaller model, you cannot. But that's interesting, right? That, like, forcing models to reason about text, I mean, I know they just have much more in there as well, but they're smarter. Yeah, but it's like, you know, like, if you are going speech-to-speech, usually, you will use smaller models, so it's still quick.

Starting point is 00:30:15 Yeah, yeah, yeah. I see, so it's also just a model-size thing. Yeah, yeah. Okay, but are there interesting differences beyond, like, correlates-like size? What I can say, it's like slightly different to your question. The people interacting through voice

Starting point is 00:30:29 and the performance we see for like how they interact with the business changes just by nature of interacting with voice. A good example, you can contact 11 labs and register for your interest, you go through the form, and at the end of that, we supplemented that instead of going through the form process,

Starting point is 00:30:47 you can speak with our agent and leave more details. And what happened are two things. One, people were actually much more keen to leave the forms through speaking with the agent, so we would go through the form a lot easier. But second, there would be a lot more open-ended in terms of what the use case are. So they would start giving us information

Starting point is 00:31:05 about the wider set of use cases, the complexity of the use case. So the writing out was tedious and tricky. This is like an open-ended adventure game. You could ask follow-up questions, you can clarify. But people were just more at ease and could trust the system while doing that, that it's working.

Starting point is 00:31:22 And that kind of helped us And then free, which maybe is like more of a technological barrier, it also works across all languages. So now we have leads from like all parts of the world coming in and leaving their details. So we did that use case. And now we have few different companies building their ADR versions of that too to help

Starting point is 00:31:38 them capture the leads coming in from banks all the way to actually one of the automotive companies that leaves that where people are just more keen to speak through voice. So I want to ask about this kind of a second order effect. You have, you know, you talked in the past about how growing up in Poland, I guess the dubbing of TV shows, they were cheap and so they would only have one voice actor for a TV show. So no matter all the parts, male and female, they're like, I love you. I love you too. You know, there's like one voice actor doing all them. And now, you know, thanks to better voice models, you'll be able to just have like really good voices, AI generated for all the dubbing. Because again, it's not like it's taking jobs from great dubbing that was happening previously. It was like awful dubbing. you know, happening in Poland previously. So that's like one example of the second order effects. What are the other second order effects you're seeing

Starting point is 00:32:30 of ubiquitous, good, text to speech, speech to text? It seems like across a broad array of languages because whatever in English, just this didn't exist in Polish or Irish or, you know, pick your language. One, like breaking down the language barrier, you know, the kind of the inspiration came from the movie side, but it also applies in any communication. set up, like, could in the future, could I travel to another country and speak, speak Polish

Starting point is 00:32:57 or speak English? And that language isn't being understood in the local native language. Like from Hitchhiker's Guide to Galaxy, this version of the bubblefish, yeah. Exactly. That you can actually understand the world. And voice, of course, will be an interaction layer, but similarly, all of us will have our own kind of extension and voice agents that can help on our behalf. And there is very clear and great examples of that of people that lost their voice and

Starting point is 00:33:20 can get it for the first time, for the first time. We see that everywhere, whether that's people that lost due to ALS or fraud cancer that can get it back. Just recently, there was an example of a patient that had a neuralink. I worked with them to bring the voice that that person could speak with their own voice back to the, back with the family around. We worked with the lady that lost her voice before she got married. And then finally, the technology became possible. We were able to recreate that voice.

Starting point is 00:33:51 and for the first time she could replicate the marriage ceremony and speak the vows together, which was such a heartfelt moment, probably like the most important from all the work that we do. When you guys talk about voice agents, is a voice agent just the idea that you have some long-running or persistent agent that is going out and interacting with the world through voice? And so customer service be one example of it.

Starting point is 00:34:21 you know, in the other direction, your claw going and making you a restaurant reservation and actually calling up the restaurant. Is that kind of how I should think about voice agents? That's right. It's exactly, whether it's like the reactive side of being able to interact with the customer or the proactive to call it back. We recently had a very interesting one. Topical because it was a Guinness-related one where there was a developer developing a Gindex effectively. Oh, I saw that. They were calling all the pubs in Ireland, checking the price of a pint. Yeah, you could like ask that or report information. That was built.

Starting point is 00:34:54 The Gindex is built with 11 Labs technology. It was built with 11 Labs too. So like people could actually, could do both sides. Could proactively reach out, reactively reach out, all was captured for voice. And then kind of 3,000, 3,000 different entities could report their prices and get that across. Have you, by the way, hooked up your open claw to 11 Labs? Is the OpenClaw 11 Labs combo, something that's a lot of people. at 11 are doing?

Starting point is 00:35:21 So, as you know, the Open Club will, like, kind of look for the most popular tools frequently where it tries to cook up. So 11 laps is one of the recommended ones. It's the top option for voice. Can you tell me a bit about the business of voice models where I think people have an intuition around big LMs where there are these very expensive training runs. And yes, they kind of appreciate it quickly, but there's so much usage that all of the models trained to date have paid off their training runs.

Starting point is 00:35:50 and then some, and then there's this kind of ever larger capex going into, I mean, a lot of it is inference these days, but also training. And so you have some intuitions from the LLM world. I'm curious just how I should think about voice fair. One, how expensive is training the voice models? Is the expense in the researchers? Is the expense in the training runs?

Starting point is 00:36:14 And I mean, the economics is presumably kind of simple, It's just per usage, but yeah, to talk us through the business. Yeah, definitely cheaper than the LM and image video models, significantly smaller models. Yeah. Okay, so the models are smaller. Smaller, smaller. What's a parameter count for a leading edge voice model?

Starting point is 00:36:32 Few billion to low tens of billion per meter models. Yeah. So... And for context, I think the... I mean, kind of like, you know, CPUs moved away eventually from gigahertz as like the metric as they moved to more cores. I think we've mostly moved away from just raw parameter count, but I think the leading edge of LEMs are in the hundreds of billions of parameters.

Starting point is 00:36:52 I think the leading ones, yes, but of course, you know, you have the variations that you will use at lower scale. So KAPEX is still pretty high. We've, of course, raised recently a half a billion at 11 billion valuation. Makes sense. Makes sense. To continue being able to build the best models in the world. Researchers, you know, of course, you want the best people in the world.

Starting point is 00:37:15 I think we have those people working in audio and my co-founder who is who is leading that work So that's that's definitely a big piece of like not financially, but even like how you keep them ambitious about a deployment so you kind of continue building leading models helps you attract more talent and and building that And then on the how we serve it's of course inference It's correlated with how the models are used and and for us like we've seen incredible incredible growth across the work. Mostly this is charged per if it's input text or text to speech, it's usually per per text token, if it's voice agent or transcription and then it's per minute. And we see that kind of being the bigger part, but usually like broadly it's per token basis. And of course as we

Starting point is 00:38:02 work with businesses, it's like an annual agreement. The bigger to spend, the bigger the comments, the bigger the discount to get it across. The way we usually do is like when we have a new model, we try to give it at cost to a lot of the customers. so they can experience the best. It's still usually, like, not as reliable. The newest thing is often the most expensive, whereas you make the newest thing the most economically attractive one?

Starting point is 00:38:25 We try to make it attractive so the customers are, like, you know, like, it's more expensive for us than any previous generation. We don't like, the quality is higher, so we try to keep the prices still competitive to that. I see, you subsidize it, but it's inherently more expensive. Exactly. Exactly, exactly.

Starting point is 00:38:40 And over time, we might do some tricks to optimize it, but, like, we want the customers to, like, experience, because of research, the big thing that we've seen is the reliability of the model in the early days might not be there. And then, two, people don't even know what's possible with that model. So you kind of want the widest set of distribution so people can show the world what's possible. So you can have it, of course, as the distribution mechanism, learn yourself what to improve,

Starting point is 00:39:07 what to change, and then get it out there. Are the voice models just getting bigger and bigger? Like, will we have voice models in the hundreds of billions of parameters, or have we found, like it seems like for certain types of model architecture, there's like an upper limit on, like, the natural size. Have we found that upper limit for voice models? It feels like for specific use cases, like, say, audiobook narration, you probably found that size.

Starting point is 00:39:31 You probably don't need to stretch it's too much bigger to make the quality as much higher. But for certain use cases, that will probably grow. The thing that's, you know, like I hesitated on the question is, In a cascaded approach, you probably will not see dramatic size changes. You inherently want the models to be quick and are reliable. You want to orchestrate them in a smart way. In a fused approach, probably that will get into like tens, hundreds billion-parameter models because you kind of combine, of course, the LM side and the voice side.

Starting point is 00:40:03 So that will get bigger. But on the just voice, I think it will keep being small. Okay. But there are certain domains where, yeah, we'll see bigger models. That's interesting. Yeah, yeah. It is amazing how it does seem fun from a research point of view, how there are still these various unsolved aspects

Starting point is 00:40:19 and how you guys are just making technical breakthroughs and then releasing them down the product pipeline. That's like a really fun stage of a company's lifecycle. For sure. It's like fun because it feels like we can do innovations on both sides. There's like so much on research side, so much on product side. And then like the kind of, you know, ultimately the biggest path is how we deploy it to the customers,

Starting point is 00:40:40 where Lytentat, SMB, will have very different dynamic than the enterprise. It's not vendor-sass relationship where you just give the product out there for the biggest companies out there, but you are more of a partner in their AI transformation part. So you want the resources to work alongside them to work on frequently, very new use cases

Starting point is 00:41:00 that were impossible to help create and bring those voice agents to production. So that's like a big, big shift. But the biggest thing, focus is how we bring the conversational agents out there to the businesses around the world. So when you say bring conversational agents is the biggest priority.

Starting point is 00:41:18 Is this for customer service type use cases? Like what are the most popular use cases for conversational agents? Yeah, like we want to be a partner for like full interactions between businesses and their customers or their audience. I'm saying their audience because that will apply in support. Support is the easiest one because that's where it's most

Starting point is 00:41:36 ready. But like, and that's maybe the biggest. difference to how we see ourselves to some of the other companies in the space is this can also apply to sales. You can have the proactive side of reaching back. You can have AISDR versions of that. And then you can have all the way to the marketing use cases where we are your partner for working on even outside of like the conversational agent space of how you create a great marketing campaign. Yes. And so how does this break down between, you know, we had Dennis Trainor from Intercom on here, and they have Finn, their agent, and it's a thing in the website that you can go talk to.

Starting point is 00:42:11 And he described a very similar phenomenon that you described, which is you start maybe thinking, oh, this will help me answer customer support queries. But it becomes like a generic UI for the website, where it's a box you can type in to go do things and understand things. And so why wouldn't you read the docs in design your integration that way, you know, whatever? And so will I have like one for text and then one for voice? Will you guys do text to? Will just how does that? Because it seems like this is also succeeding at the text level with Finn and Sierra and all these things. The places where we know we will be able to provide the biggest value is like where ultimately today will have either a big portion or most of their interactions coming for voice.

Starting point is 00:42:59 So if that kind of intersection is. is there, that's where we can provide higher value. And of course, like, if you need a text chatbot there, that's like, if you fix the voice agent, you'll have fixed text piece inherently as well. But the place where we do optimize today is going to be like, how do you select the right voice for the right customer interaction, how you pull that

Starting point is 00:43:20 in the pretty complex case of what you mentioned earlier, of like how you orchestrate that to pause or look for something deeper into the docks, how it can be extension of entirety of the business, so not only in support, but across entire of a user journey. But the bottom line is like, we want to be able to provide you across entirety of the interactions. Voice is usually a big part of those interactions. And yes, we need to solve the integrations, we need to solve the knowledge, we need to solve

Starting point is 00:43:48 text as part of that. But like we wouldn't, for example, go into what I think will happen in a lot of those cases, like very deeply into reasoning version of those use cases, where you maybe need to like the multi-touch. Yeah, yeah. And a lot of complex actions. A lot of like financial analysis. That would be not something we optimize for. Can we talk about your revenue ramp?

Starting point is 00:44:09 You're just one of the fastest growing startups period of the past few years. What's your most recently announced revenue figure? Most recently announced was end of 2025. Whatever number you want to give us. So most recently announced was 350 at the end of 2025. But the best proof of the technology working. So recently we announced our work with Deutsche Telecom and T-Mobile with Revolut, with Klarna, with Meta, with IBM,

Starting point is 00:44:32 a wide set of use cases. And this quarter was kind of one of the best for enterprise growth, where we had the first quarter hit $100 million in an additional ARR growth, which is crazy. In net new ARR. In net new ARR. Okay, so if you're saying this quarter

Starting point is 00:44:49 was $100 million in the end of the year, I'm no mathematician, but it's up in the $450 million range. And that's versus this time last year, that's a several-fold increase. Just what's working? Like from the outside, I would assume that there's really strong cohort growth within accounts, and then you seem to have self-serve and enterprise businesses that both contribute a lot.

Starting point is 00:45:16 I don't know how big self-serve is, but as a user, I like to be able to fiddle with 11 labs and not have to go talk to sales. But maybe you can just talk about what worked to reach 450 million plus of ARR so quickly. Yeah, so exactly. So we are over 50% is now sales led to enterprise. Yeah. And you know, like, I think largely that the technology that, like, powers a lot of their agentic interactions just became reliable at the same time as high quality over last year, year and a half. So, like, that's, that's, that's, you know, frequently, you know, you know this extremely well. You will, you will start the account and then, and then, of course, it continues expanding.

Starting point is 00:45:53 And we see, there's definitely land and expand motion in the 11 laps. we bring. And what does that expand look like? Is it like new departments? Is it just the usage starts taking off? When a customer expands. Both, but usually the first part, too, it's like, we try to make it very easy for our customers.

Starting point is 00:46:11 Maybe that kind of against ourselves, where we give the technology a pretty attractive economics, because we so much believe in the technology providing value. So you can actually try it and test it. And then within that one department. And you think you'll make it up in usage, basically. Exactly. That the usage, the kind of commit continues increasing

Starting point is 00:46:29 because you know it's providing value. And then it's so much easier to make that a choice. And then, of course, cross-department pollination is there too. And it's like, you know, our work of digital comes sort of marketing side. So we did magenta work and Pockta's generation. And then it kind of expanded to customer support.

Starting point is 00:46:46 And then it expanded to us working on an agent across the entirety of the network so people can call in and have the agent. So you could see those step changes, step changes across. But we are now 400, 470 people as a company. So we keep on growing. But some of the things that stay consistent is small teams. So we have less than 10 people teams for each of the product or research initiatives or even as you think about sharding some of our go-to-market strategy. Those will be smaller teams, understanding the industry in depth,

Starting point is 00:47:20 understanding the market in depth and going independently and going quickly. So that definitely contributed largely to that. Two, especially on the biggest enterprises, what we found works is, and it's like we have the full spectrum, self-serve, PLG motion, that helps drive distribution, drive, kind of awareness of 11 labs.

Starting point is 00:47:40 And on the completely other spectrum, we have the high tide for deployed engineering working side by side with the customers to customize the entirety of their work together. Why did you guys do self-serve? Because I presume you have a lot of competitors, where they have tech and it's behind a contact sales forum

Starting point is 00:47:58 and you have to go talk to an SDR and then talk to an AE, blah, blah, blah, blah. And you guys just offer the tech available on this. And I'm a huge believer in this. I mean, a huge part of Stripe's growth has been driven by the fact that we just made Stripe available to anyone and built a lot of product around that adoption pattern.

Starting point is 00:48:14 But so many companies seem to skip it. So I'm curious how you guys came to... So many reasons. So many reasons. I think, you know, the quick ones that come to mind is feedback You have immediate understanding of how good your technology is. Two, which is an extension of that. We stand behind our tech.

Starting point is 00:48:31 We believe it is the best in the world for models, for voice agents, for deployment. So we want people to experience that. And I think you do that the same in Stripe, where the best version of the technology is available to everyone. We're just so attractive to actually try it out. We always try to make everything we built for the highest end-use cases, bring it back to the ecosystem free. Frequently, the newest of the use cases, you know, for enterprise, you will need reliability, you need compliance, you need the scale, which we deliver. So frequently, as you develop new technology, it might not be ready for a lot of those

Starting point is 00:49:06 parameters, but it's definitely ready for developers and SMBs. And we love what they are doing because they are showing us the future and effectively helping us find a trajectory of where a loud-lapes should go. I'm totally convinced on. I'm just always amazed that more companies don't pursue it, where it feels like they're really shooting themselves in the foot, by not. Like, did you guys self-serve on Stripe? Or did you...

Starting point is 00:49:27 We self-serve on Stripe? Yeah, for example, you know, 11 is a huge company. And yet, you started on Stripe on a self-send business. You kind of, like, initially, and it's like, you know, we were two of us at the beginning. You try to see what's working in the industry, but you try to think from first principles. So you want to try it out. You want to understand how it works. So the more friction elements before you're trying it out, the less you trust whether

Starting point is 00:49:48 it's available, whether there will be additional payment that's hidden behind, some of those steps, so you don't want to go through them. So it's so much. Speaking of Stripe, do you have any stripe feedback for us? Anything you want us to fix? My most common feedback until recently is, like, why don't you give us pay us, you go user-based billing type version? But one of our finance needs, Machek, I know I was speaking with your team,

Starting point is 00:50:11 and that was day before. Yeah, yeah. He was great. He was like thinking about it for a long time. He's great. He said, like, you guys should buy metronome. You should buy a metronome. And then the next day, metronome acquisition.

Starting point is 00:50:22 was announced. So now you have it. So that's, that is my most common feedback and we'll be launching. Oh, that's a good announcement for this, for this, for this podcast. We'll be launching user-based billing to everyone. Oh, sorry, I'm shocked you. Oh, as in previously. Pay as you go. Pay as you go. Okay. Previously, previously you had it on an enterprise basis, but everything on the self-serve basis was like plans. So we had the subscriptions, yeah. Subscription plans, you can go over them. Yeah. But now we are launching a full pay-as-you-go experience. So you can just try out voice engine, which is effectively this all orchestration loop all the way through to any of the models directly.

Starting point is 00:50:56 Going back to self-serve, I think a new thing in AI is that all self-served products should have pay-as-you-go as an option. Maybe you want to have like a subscription with some unlimited to yours, but I don't know if you had the experience of like you're using Claude and like you're typing away your queries and eventually you hit some rate limit and it's like, sorry, you've hit your usage limit and you want to be able to do the thing that you can do a Cloud Code, which is just pay per API, it's like, I'll pay for it. And it's kind of very funny as a consumer to not have the option to pay more, to use the product more. And so, yeah, I think every AI product will need,

Starting point is 00:51:29 you know, they probably want to have some all you can need, most of what you can need subscription with limits. And then the ability to pay for over. So it sounds like that's what you're Yeah, exactly. That's what, that's what we're doing. The only thing I want to ask you about is, I feel like all CEOs of larger companies today are trying to figure out how, do all these AI advancements change the nature of the organization and how do you redesign your organization a bit around all this new intelligence and so that could be about what the scaling factor is of like the number of people you need to do the work but it also should be like do you need more senior people because they're better able to direct the AIs

Starting point is 00:52:12 and the AIs or maybe you can do the work of what previously would have been junior people do you need more junior people because they're going to be more AI native in how they work do you want smaller teams, do you want bigger teams? How do you actually go to the process engineering of, your finance team should be using Claude extensively? But like, finance teams do not historically, you know, have a lot of home-built software. And so there's all these questions that are floating around. And you have very rapidly built a much more AI-native company. And so I'm curious what lessons we should all be learning from 11 Labs as a large business recently built. And so without the baggage of decades of, you know, how we've always done it.

Starting point is 00:52:52 Yeah. Yeah, we started between two, we're just like a year when the two topics of the day were crypto and metaverse. So just before and then I, of course, AI flow started. Exactly, exactly. But we could like have the privilege of like, kind of scaling through the world when it was all happening. For us, what works. And we like really believe in that being the big part of the future. The first is small teams, like keeping the teams small and super flat. So like, can you have, both me and my co-founder will have over 15 direct reports each that we'll work with. And most of those people will have that same scale

Starting point is 00:53:27 of direct reports. Okay, so your span of control is way larger in the traditional company. Normal would be eight. You have double that. And obviously, that's an exponential. Exactly. And of course, you know, there are some teams

Starting point is 00:53:37 which in the short term might not do that. But ultimately, that's where we think is going to be headed. It's like roughly 10 team size within each of those work items. And startups, no offense. But like startups often have pretty wacko management ideas. Like there was a funny tweet, Lord Grant me the confidence of a, you know, early stage startup founder blogging about their management theories. But like, you think this is not a startup effect.

Starting point is 00:53:58 This is an AI effect where basically... No, it's definitely a little bit of startup effect too. It's a... I think it out, it's like... Hindsight, hindsight benefit. I'm canceling our stripe changes. Yeah. No, no, it's like...

Starting point is 00:54:11 I need to pre-end it. I'm like, you know, it's a... The hindsight of this may be working. We'll see in next five to 10 years. So much flatter org. Much flatter org. So it works for us. It might not work for all the companies.

Starting point is 00:54:24 And there are some parts where, like, go to market. We still are trying to figure out what's the best way. But smaller teams, flatter org. And I think there are two paradigms, but like generally people being more technical. Or if not technical, even in non-technical teams, having a technical resource. So, you know, we will have a person in ops or in talent that will, we have effectively a tech lead for that team. Yes. That helps them automate a lot of that.

Starting point is 00:54:47 that work and helps up level the rest of the team too. Yes. So there are kind of two parts that are helping. Okay, so talking through this in talent or something like that, is it that you are building your own software for other companies might have bought software like a workday or a greenhouse or something? Is it that they are using the existing software you have better? Is the process that would be spreadsheets in a traditional company are built with software? How do you kind of use the software in these sorts of organizations? Yeah, like sometimes, but we still use a lot of like the traditional vendors.

Starting point is 00:55:22 Like one, pattern is, of course, elimifying everything, like making the data explorable for you to be able to interact with it. Of like who's in the pipeline, what worked, who does the best references, like all of that, all of that works. So you can double down on that. But two, it's frequently things that you manually do that a lot of the car, like there's a gap between where the agents are today versus what you could do if you have the technical skill set. And a good example, it's like, how do you scrape all the right profiles to be able to reach out to the right candidates? So you're like, analyze whether it's, you know, how much I should want to say, but, but the, like, try to detect specific things that we know worked. So you'll bring that across to the, to the people. On go-to-market side, like, there's just so many things you can do with additional amplifiers.

Starting point is 00:56:12 It goes from understanding what case studies are relevant and creating a good pre-read for you before you go to the meeting, through creating the AISDR experience that we spoke about, to creating an entire deck experience. So you have a pre-populated deck with the right numbers that is customized to that customer, which you want still the person to go through and develop, but ultimately is in there. So there's plenty of those additional things that you know will amplify the work of the people around. potentially replace some of those easier tasks that are done. And then there's like, you know, we wanted for people to explore the culture at 11 Labs. We created a voice agent that people can speak with and see what's the culture, but also get prepped for the interviews. I think across many of those teams, like additional benefit of what they can do.

Starting point is 00:57:00 Interesting piece. So, of course, in Ukraine, with ongoing work, they need to rethink a lot of how their development, their systems, their support workforce for the citizens across the country. And people are in the war zone. They don't have the same access to the information. They cannot rely on the same phone lines. They cannot rely on the same physical services around the country. So they've developed effectively a central...

Starting point is 00:57:22 Is your employees in the Ukraine? We had a few, but they reached out because they were developing their central map called Dia. They developed it over the years, but now before they were double-downing of how this can doubling down on how this can be a way of supporting the citizens. And of course, there's a easy part of how you create a first agenda in the government where you have a help with the benefits and what's happening on the front line or education. So that's delivered to everyone or healthcare so you can book your checkup or appointment. So like how you create all of that. And of course, we travel to Kiev.

Starting point is 00:57:56 We worked with them on bringing that and making that available for voice so everybody can access it. But the thing we've learned while being there was that model of what we speak about where you have technical resources in each of the teams. they actually have the same in every of the ministries. So every ministry had technical resources working on creating that agentic version of their work. And then it was like a central digital transformation team that would like assemble this all together to deliver that for the central citizen support,

Starting point is 00:58:24 which I thought was brilliant. That's very straightforward by Ukraine. So take forward. Like the most advanced set of work we've seen. So we got a little bit validated like, okay, maybe technical resources in each of the teams is a good idea. And that works heavily for us.

Starting point is 00:58:38 And, you know, you mentioned some of the other parts, like, do you hire the senior or younger? The main thing we try to filter for, of course, the culture piece is so important. You can scale people, but scaling culture is much harder. So, like, you want to optimize for that being right. And in our case, it's first principles, taking ownership, striving for excellence, but staying humble. And the main thing that's kind of in that ownership part that I think works well for the AI world is agency. Like if you have that agency to explore, regardless of where you are in the experience,

Starting point is 00:59:08 cycle, it's going to be a tremendous samplifier to your work. My biggest takeaway from all this has been that around agency, where I feel like high agency people are the winners of the advances in AI and within organizations, low agency people will lose out. Yeah, completely agree. Probably the most proud thing that Piad and I are as we scale the 11 laps, the people that are at 11 laps, it's been like just the culture and seeing the expansion of the culture, where culture builds the company now rather than any single person or any single product

Starting point is 00:59:45 builds the company. That was probably the biggest validation and happiness. And there is kind of the other angle of that where I think people are like striving to be incredible in their craft and their work, but at the same time have fun and a lot of their work and that kind of combination of agency and just enjoying what you do is probably the the best thing we've been able to do today at 11 laps. Well, it sounds like a really fun stage, like we were saying. Interesting research breakthroughs, really fast-growing business.

Starting point is 01:00:14 So I'm sure you're enjoying it. Andy, thank you. John, thank you so much.

Cheeky Pint - The world of voice AI, with Mati Staniszewski of ElevenLabs

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.