No Priors: Artificial Intelligence | Technology | Startups - The Future of Voice AI: Agents, Dubbing, and Real-Time Translation with ElevenLabs Co-Founder Mati Staniszewski
Episode Date: December 11, 2025Imagine learning chess from a grand master, or negotiating tactics from an expert FBI hostage negotiator. ElevenLabs’ voice AI technology is making that unlock possible. Sarah Guo sits down with Mat...i Staniszewski, co-founder of ElevenLabs, to explore how the three-year old company is transforming how humans interact with technology through voice. Mati talks about the technical challenges of building foundational audio models, the strategic thinking between conducting research and deploying products in tandem, and why voice is the ultimate interface for everything from computers to robots to immersive media. They also discuss how the coming revolution of AI personal tutors will shift agentic AI from reactive to proactive support, break down language barriers globally, and even provide the framework for agentic government services. Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @elevenlabsio |@matiii Chapters: 00:00 – Mati Staniszewski Introduction 00:46 – 11 Labs: Growth and Scale 02:46 – Voice Technology and Applications 06:52 – Research and Product Development 12:36 – Voice Quality and Customer Preferences 17:54 – Agent Platform and Use Cases 23:21 – Choosing the Right Technology Partner 26:43 – The Role of Foundation Models 29:58 – Open Source Models and Future Trends 32:37 – Research and Development Focus 36:53 – Future of AI Companions and Education 41:37 – Conclusion
Transcript
Discussion (0)
Hi, listeners. Welcome back to No Pryors.
Today I'm here with Madi Stanis, the co-founder and CEO of 11 Labs, which was founded to change
the way we interact with each other and with computers with voice. Over three short years,
they've skyrocketed to more than 300 million in Run Ray.
Madi and I talk about the future of voice, education, customer experience, and the other
applications of this voice, as well as how to build a multi-segment from self-serve to enterprise
and combined research in product company. Welcome, Mari. Sarah, thanks for having me.
And thank you for doing this at 7 in the morning. Our pleasure. Thank you for doing that at 7 in the
morning. It's great we got to finally do this together. I think a lot of our listeners will have used
or played with 11 at some point, but for everybody else, can you just reintroduce the company?
Definitely. At 11 labs, we are solving how humans and technology interact, how you can create seamlessly with that technology. What it means in practice is we build foundational audio models, so models in a space to help you create speech that sounds human, understand speech in a much better way, or orchestrate all those components to make it interactive, and then build products on top of that foundational models. And you have our creative product, which is a platform for helping you have narrations, for
audiobooks, with voiceovers, for ads or movies, or dubs of those movies to other languages,
and our agent's platform product, which is effectively an offering to help you elevate
customer experience, built an agent for personal AI, education, new ways of immersive,
immersive media. But all is kind of underlieve that mission of solving how we can
interact with technology on our terms in a better way. You started the company in 2022.
That's right. And you've had amazing like rocket ship growth since then. I'm sure it's
felt up and down different ways. I want to ask you about that. Can you give a sense of what the
scale of the company is today? So we've grown to 350 people globally. We started from Europe.
We started as a remote company and are still remote first, but have hubs around the world,
with London being the biggest, New York being second, biggest Warsaw, San Francisco, and now Tokyo
and one in Brazil. We are at 300 million in ARR, which is roughly 50-50 between self-serve.
So a lot of subscription and creators using our creative platform and then approaching 50% on the
enterprise side using our agents platform our work. And that's on the saleslet, classic saleslet side.
And we serve more than 5 million monthly active on that, on that creative side of the work.
And then on the enterprise side, we have a few thousand customers from Fortune 500s to some of the
fastest AI growing startups. I think this is such a, you're an amazing founder, but I also think
that's such an interesting company because it is very unintuitive to, I think, many people
and investors in particular. I don't know if you face this at the beginning, but I remember
there in 2022, there's a class of companies that allow creation in some way. When we look at
your first business, beyond the research itself, and I would put 11 and Mid Journey and Suno
and Haitian in this category. And I think there's like this overall sense of like, who really wants
to do this. What was your initial read of how many people want to make voices or what made
you believe that was going to be much broader than, like, if I look at dubbing, for example,
it's not a huge market. I think first piece was, which is, as you mentioned, there is like a very,
it's very tricky to do both the product and the research. I'm in a lucky position that I,
that my co-founder and I know for 15 years. I think he's the smartest person I know and has been
able to create a lot of that research work, to be able to create that foundation to then
elevate that experience. But both of us are from Poland originally. And the original
belief came from Poland. It's a very peculiar thing. But if you watch a movie in Polish language,
a foreign movie in Polish language, all the voices, whether it's a male voice or a female voice,
are narrated with one single character. So you have like a flat delivery for everything in a movie.
A terrible experience. It is terrible experience. And it's still, like, if you grow up, as soon as you
learn English, you're like switch out and you don't want to watch content in this way.
And it's crazy that it still happens until today in this way for a majority of content.
Combining that and I worked at Palantir, my co-founder worked at Google, we knew that that will
change in a future and that all the information will be available globally.
And then as we started digging further, we realized in every language in a high quality way.
Exactly.
And the big thing was like, instead of having it just translating.
could you have the original voice, original motions, original intonation carried across?
So like imagine having this podcast, but say people could switch it over to Spanish and they
still hear Sarah, they still hear Mati and the same voice, the same delivery, which is kind of
exactly what we did with Lex back when he interviewed Narendramodi and you could like kind
of immerse yourself in that story a lot better.
So that was the original kind of insight and we then started.
digging further, which is that just so much of the technology we interact with will
change. Whether this is how you create, it's still relatively tricky to bring voice alive.
You need to go through the expensive process of hiring a voice talent, having a studio space,
having expensive tooling to then actually adjust it. The tooling isn't intuitive to be able
to do this. So like all that creation process will and should change to make it easier for
new people with keenness to bring that alive. Then a lot of
the technology wasn't possible for you to be able to recreate a specific voice or be able to
create that in that high-quality way. And then, of course, as we dived into further and shifted
away from the static piece, the whole interactive piece is still crazy in the way it functions
where most of us seeing this technological evolution over last decades, but you still will
spend most of your time on the keyboard, you will look at the screen, and that interface feels
broken. It should be where you can communicate with the devices through speech, for the most
natural interface there is, one that kind of started when the humanity started, and we
realize we want to solve that. And I think now fast forward from 2022, I feel like many people
will carry that belief too, that voice is the interface of the future. As you think about the
devices around us, whether it's smartphones, with computers, whether it's robots, speech will be
one of the key ones, but I think 2022 it wasn't. And as you think about the market for the creative
side or for that interactive side, it was like very clear it will be a huge, a huge, huge, huge one.
So even when you think about just the research part of your business and then you have products
for at least two different markets and then you have this larger mission, a lot has changed
the last five or ten years. But it used to be like a very strongly held traditional belief of like
one must do one thing well in a startup and there's no other path like you're treating this like
interaction company a platform company how did you think about sequencing like the research and the
product effort does that make sense or like thinking about new markets and maybe wrapped up in
that question too is just like well where are we in quality on voice as well because if if I would
sort of claim like if the models are not good enough for certain use cases at all like kind
doesn't make sense. Due product. And I think that's right. It's almost exactly like when we
started originally what we did was try to actually use existing models that were in the market
and optimize them for our first use case was actually starting with combination of narration and
dubbing and then on that creative side. And we realized pretty quickly that the models that existed
just produced such a robotic and not good speech that people didn't want to listen to it. And that's
where Michael Founder's Genius came in,
where he was able to assemble the team
and do a lot of the research himself
to actually create a new version of creating that work.
But to your question,
I think that the way we are kind of organized internally
and how we think about sequencing a lot of that
was looking at the first problem
and then creating effectively a lab around that problem,
which is like a combination of mighty researchers,
engineers, operators to go after that problem.
And the first problem was the problem of voice.
So how can we recreate them?
in the voice. And like you say, it needs to have that research expertise to be able to do that
well. So we started with effectively a voice lab, which was that mission of can we narrate
the work in better way. There was a combination of roughly five people that were doing that work,
and then sequence the research first and then build a simple layer on top of that work to
allow people to use that work and then kind of expand it from there with a holistic suite for
creating a full audiobook and then creating a full movie narration.
And then we move to the next problem, which is realization that, okay, we have solved the voice, great for making content sound human, the first problem, for that to be useful for us to interact with the technology.
You need to solve how you bring the knowledge on demand into that.
So we effectively started then the second team, which was a second lab, an agent lab, effectively, which was a team that would combine researchers, engineers and operators once more, which would try to fix, okay, we have text to speak.
speech. How do they now combine this with L-LAMs and speech-to-text and orchestrate all those
components together while integrating that with other systems to make it easier? And then
similarly, you know, you kind of expand from looking just at the voice layer into how those
systems work together. And here too, you need a researcher expertise to do that in a low-latency
way, efficient way, accurate way. But at the same time, there's that product layer that starts
forming, that it's not only the orchestration that matters. It's also the integration.
of how you link up to the legacy systems,
how you build functions around it,
or how you deploy that in production
and test, monitor, evaluate over time.
Do you feel like you were creating new use cases?
When you built the tools, do people know
that they wanted to do this already?
Because one argument that I remember hearing
was like, ah, like, you know,
enterprises don't know what to do with voice,
how many people really want to do it?
And then you're serving essentially,
like, perhaps the like creator-publisher side of your business.
Yeah.
It's definitely a combination.
of like initiatives that we believe will happen in the world and then like response to a lot of
that. Like as I think back we you know of course voice the internal voice lab or agents lab then
kind of that kick started so many of the other labs in response to the problems. We started a
music lab because people wanted to create music with 11 labs. It was a fully licensed model where
people wanted to use and create speech but they wanted to add music in a simple way. We wanted to
deliver that. And then, of course, that kind of came together through how do we combine
music, audio, sounds. We are now integrating partner models from image and video into that
suite. How could you combine all of that in one? And all of that was in response to the market
saying, hey, we would love this. And then you will have completely different use cases, even in that
space, let's say, dabbing. Dabbing is a use case that we didn't feel there was like a big push for
that, but we knew that in the ideal world in the future, you will be able to, you will be able to
to have that content delivered naturally around the languages still carrying that.
And I still think actually this market will be immense because it's not going to be only the static
delivery in movies, but if you travel around the world and want to communicate in real time,
like the full bubblefish idea from Hedgekiker's Guide of Galaxy, this will happen.
It will be like the biggest, like the whole breaking down language barriers that are the barriers
to communication, to creation, like all of that will break.
And that will be like the foundational real-time dabbing concept.
super excited about that part.
And similarly, on the agent's side, you are like some obvious things that, of course,
customers that we work with, our partners will want to integrate, which is we want
integrations with XYZ systems.
But then there are like other parts that might not be as easy to predict of as you
interactive technology.
Of course, want to understand what's happening.
But you also want to understand how the things are being said and bring that into default,
which would be something we try to prioritize on our side.
So then the people, when they actually interact with the technology, they realize, oh, expressive thing is actually so much more enjoyable and beneficial and helpful.
So I want to ask a question about this, which relates to quality.
You know, I work with a series of companies where we're selling a product to the buyers are generally not machine learning scientists.
Right.
Right.
And even the scientific community does not have the, like, full suite of evals and benchmarks to understand every domain well.
There's a well-known problem, but I imagine for a lot of your customers, it's not like they, like, know how to choose good voice?
So how do you deal with that problem?
Like, is it like, hey, I make a clone and like that sounds like me and I believe it.
I'm going to try all of these different options or, you know, actually are you teaching people to do eval?
It's a great question because I think there are like two big problems.
One is like, how do you benchmark the general space in audio where, like you say, it's like so dependent on the specific voice.
let alone, like, if you are training into interactive, then it's like even more tricky.
And then the second piece, which is as you are working on a specific use case, how we select a voice.
So I'll take the second front first, which is we have like a voice sommelier effectively.
As we work with enterprises, we deploy that person to work with them and help them navigate.
That person is like a voice coach, has an incredible voice themselves.
And now we have like a team under that person that like will partner to help you find what's the right.
branding.
And now you have like the celebrity marketplace.
And now you have a celebrity marketplace to like help you even get
iconic talent in there like Sir Michael Kane.
That piece was important because of course the voice will depend on the use case
that you are trying to build, the language.
All of that will have an impact of what's the right voice for your customer base.
So we have effectively a voice person helping those companies.
And some companies will be very opinionated on one they want.
So they will sometimes select it themselves, sometimes give us a
a brief of, hey, we want a voice that sounds professional,
neutral, is coming.
We recently had a company, one of the biggest European companies
that gave us a brief, which is very original,
that they wanted as robotic voice as possible.
Okay.
I just counterintuitive.
Yeah.
But for example.
And you're like, we can't do that anymore.
Almost.
But we were like trying to go backwards of like, how do we do that?
But I think we got a good result.
But recently we had a company.
in Japan, where Japan and Korea, where they wanted to serve different voices, depending on
the customer, that's calling in.
They have an older population and a very younger population.
The younger one, they wanted, like, one of the famous voices in the market that's
very excitable and happy.
And for the older one, they wanted like a calm, slow speaking one.
We help a lot with that.
So that's on the voice piece, and I do think it's going to be a big one.
It's like a personalized choice and then it can even be dynamic.
Yes.
Okay.
Exactly.
Exactly.
And then maybe in the future, it's like going to be like fully depending on your interaction.
You all have a voice created as we understand the preferences of what people want.
So, you know, like, let's say you are in the evening and you are tired and you want to slightly different.
Or maybe not.
Maybe that's like the best focus time that you have like a voice that's giving that energy.
And probably it's different when you wake up and gives you the morning news of what's happening or what's the weather.
So like all of those could be different.
Yesterday we had a dinner with some of our partners
and one of them, the first thing they said is like,
hey, I have a new request for you.
I want a New York voice with a Long Island accent,
which I never knew as a thing.
And it's supposedly a thing.
So we have that.
And then on the first piece,
I think it's unsolved problem still,
where I think you have a good benchmark,
of course, in LEMS, I think in image space,
they are pretty good.
In voice space, you have, of course, the speech quality,
but then so much of whether you like or not the speech depends on the voice
that just if you compare model A to model B
and you serve them different voices,
even if the quality is very different,
the voice itself can just make that so different.
We've seen this, I don't know if you're not artificial analysis,
benchmarks, I think they're pretty good.
Just switching the voice makes that such a big idea.
That's so interesting.
Yeah.
And I wonder if, as you said,
this is the most dominant interaction move
we'd had for millennia of all of human history, right?
And so I'm biased and self-serving, but I think so.
We're just very sensitive to it.
And I think people are going to be very sensitive to their own personalization as well.
100%.
I think there's also a third piece, which maybe it's not directly to your note,
but we've also realized that you have, so you have the benchmarks,
you have like, how do I find the right voice for my audience?
But even the understanding of how you describe audio data is still lagging in the industry.
like when we initially started, we of course went into the traditional players for them to help us label not only what was said, so like transcription, but also how it was said, like what are emotions used accent. And most people just weren't able to do that work effectively because you kind of need to hear and have like a little bit of a skill set of like how would I describe this specific delivery. We need to create that ourselves. So I think there is that piece as well. Like how do you effectively interpret the data of audio in a, in a, in a, in a, in a,
And a more qualitative basis, that's, that's, yeah, trickier.
Can you talk about what's happening on the agent platform side?
Like, what is challenging for, you know, businesses or even creators that are trying to build agents
and maybe what the surprising or high traction use cases are?
I think everybody's kind of aware of the idea of, like, agent-based customer support.
But I imagine you're doing many things beyond that.
Yeah.
So that, exactly.
Customer support is probably the one that's, like, kicking off the quickest.
and that's the one that like we see overtaken so many use cases,
but it's where I work with Cisco or Twilio or Telz Digital.
All of them are kind of elevating that to a high extent.
I think the second exciting piece within that domain,
which is happening, is the shift from effectively a reactive customer support.
I have a problem.
I'm reaching out the customer support into more of like a proactive part of the experience
customer support.
So to make it explicit, we work with the biggest,
e-commerce shop in India, Micho, where they started working on the customer support side,
where I want a refund, I want to see the tracking of the package, to actually having an agent
be a front part of the experience. So if you go to the website, you can, you have the widget,
you can engage it for voice, and you can ask it, hey, can you help me navigate to item X,
item Y, or can you explain you what's the right thing for me to give up for a gift for this
period of time. And then it will actually help you based on your questions, based on what
is on the offer, show you those items, navigate to the right parts of the piece, maybe go all
the way through the checkout. And I think this will be a phenomenal thing of like elevating the
full experience where that's more of an assistant across the whole thing. We caught
kicked off our work with Square that enables all the businesses to do that work, exactly the same
pattern. Started with voice ordering. How can now this be part of the full discovery experience
too, where you get items shown to you. You can have a lot more explanation.
which I think will be a phenomenal piece
where it's effectively from the beginning to the end.
That's one category.
The second one is the wider shift from static to immersive media
where there's just so much incredible stories in IP
that today exist in effectively one way of delivery
and now you'll be able to interact with that content
in a completely new way.
I think one of the incredible use cases was working with Epic Games.
We worked with them on bringing the voice of Darvader
and Dar Vader into Fortnite
where millions of players
could interact with Darvader life
in the game where you had like a
full experience of
Darvader in a new way
and I think this will be a theme
across whether it's talking to a book
talking to the character that you like
to the whole space shifting
and then I think the one that's
that I'm most excited about for the world
and for the shift is going to be education
where he will just be able
able to have like effectively a personal tutor on your headphone. It could like actually study
something in an amazing way. I'll give you like two quick examples. One is we recently worked
with chess.com. I'm a huge fan of chess. I'm a huge fan. Okay, great. So you can learn chess,
but you can have Hikaru Nakamura or Magnus Carlson be your teacher of how you deliver that
which is amazing or even Botte's sisters or it's like all the plethora of different players
that engaged with that which I think is great and then maybe a last one which is a masterclass
we worked with to shift from you can of course have the content and go through step by step
but you can also have like an interactive experience and the best example of that was working
with Chris Boss the FBI negotiator one of the top negotiators who is a master class lesson
but then you can actually call him and have a practice negotiation
which is crazy.
Yeah, got to get that hostage out.
We'll definitely try it.
Yeah.
Can I add one more?
I think the one last one, which combines all of them together, which I realized just recently, which was crazy.
So recently I went to Ukraine, where we are working with Ministry of Transformation, where they are effectively creating a first agentic government.
And the crazy thing is they have all of those.
Agentic government.
Okay.
So they want to like re-change of how they run.
all the ministries.
Okay.
And it sounds like a big, ambitious goal and lofty.
No, I think the baseline is like here.
So actually, I'm by that immediately.
Yeah.
And the crazy thing is I think they are like so ahead and actually doing that.
And I think they are like two concrete things there.
One, they kind of combine all those use cases.
So they, we are looking into how they can have effectively customer support of
government, whether it's asking about benefits or employment, about the process of how
you leave the country, all of that be run through effectively a digital app, then two,
how you can have a proactive way of informing citizens of things that might be happening,
but then having an education system that also run through like this personal tutoring experience,
and all of that is happening.
So that was incredible to see.
And the second amazing thing was that the way they've done it, so they have the digital
transformation piece, but they have engineering leaders in each of the ministries that lead
those efforts and then bring them back to that one central piece.
So that is like incredible to see and also proud to be able to be working with with them on that shift.
But despite everything that's happening, they're like so leading it.
That's really encouraging.
Can I ask you a business model question here?
Of course.
Looking at the strategic landscape, actually I have many questions here.
One of the observations I'd have is if I look at one of these like rich voice and action agent experiences,
It's a lot of, let's say, Fortune 500, Global 2,000 leaders who listen to the pod.
They, I think a lot of them are going to buy the idea of, like, I want this amazing,
automatic, like, real-time available 24-7, every language experience for my customer.
That's consistent in high quality.
The ways I might get there include working with a Palantir or a large consulting firm.
working with 11 or a like platform technology company or like an opening eye or something right let's talk
about that or working with a sort of more use case oriented company like Sierra right how do you
think about how people are making that decision or how they should make that decision the so so my past is
also in Palantir so I started exactly from from that side and we do blend a lot of the forward
deployed engineering inside of the company too as I think about the kind of our offering and
and the customers making that choice.
If you are looking just as like one pointed solution
and only that one,
then likely we aren't the best choice.
If you are looking to deploy that across a plethora of different experiences,
so I'll be able to customer support,
but then you also want internal training.
And you might want to elevate your sales part
and actually increase the top line with new experiences
of how you engage customers beyond that kind of reactive piece.
Then it's a great platform to build.
And then we effectively, as we engage with customers,
combine that platform work with our engineering resources to help those companies deploy
on that, or which we also see increasingly in Fortune 500s,000s, where they will want
to build parts of the things themselves because they already have a lot of the investments
in that platform, while then engage us on some of the new ones and combine those.
And I think that our model and the way it's different to a lot of the use case-specific ones
is that our platform is relatively open
where you can use pieces of that platform
and not all of them
for those different use cases.
Pantier, of course, or some of the consulting companies
will have a lot more resources
to go in the wider digital transformation journey.
In our case, it's like very specific conversational agents.
Like if you are looking for new interface with customers,
that's the best way.
And companies like Sierra, phenomenal, of course,
on how they are thinking about the specific pointed
use case. And maybe the other piece is
like a sweet thing about our work
depending on what you are optimizing for. So
we have a lot of international partners. If you have like a
wider geographic user base, great. That's what we optimize for. Our
voices, our languages, our support for integrations internationally
are just so much broader. You're just frequently a piece that you will
look into. Depending on your exact scope, this will be
does be a big factor. But I would summarize that if you are looking for a solution across
a set of different use cases that you want our engineering help and deploy that, then we are
the right solution and probably the best solution. I want to talk a little bit about maybe like opening
eye and the foundation LM foundation model companies. One of the reasons a lot and I call this podcast
no priors is because we're like, okay, people are making a lot of assumptions all the time about
how the market is going to work. And lo and behold, like many of those assumptions end up being
nonsense, actually. And you can't, you have to very much decide your own narrative at this point
in time. I think, correct me if I'm wrong, like in 2022 and 23, you probably heard a lot of people
say, like, Google can do this and opening I can do this. And like, why do you get to persist
working on voice anyway as a general capability? What's the answer? That also adds, as a kind of
another element to that a couple of the other previous questions where, what is agents work? What is
the creative work. Deploy the value in those work, you need a very strong product layer.
You need the integrations. You need to help people deploy the work, which is the most common
piece. But our superpower and our focus for a long time was building the foundational
models to actually make that experience seamless. And as I think about the companies in the
market, they will optimize for a lot of other things. And that will be like the differentiator.
in our case where we will make the whole experience,
especially with voice,
seamless, human, controllable in a much better way.
And so fundamentally you would argue that, like,
the labs just aren't going to focus on this and haven't.
Exactly.
So I think most of those companies,
and that's the thing about the long term,
it's going to be incredible research and an incredible product
that meets customers where they are
and work backwards from there.
I don't think the labs will focus
on building that product layer that's so important.
But I think part of the question that you're asking is like how or and why they haven't
done even the research part to the quality that we've been able to us here.
I'm also biased, but we are happily beating them on benchmarks with text to speech or speech
to text or the orchestration mechanisms.
And I'm here credit to all my co-founder and the team that they've been able to do it.
It's just a mighty researcher is just continuing their work.
But I think the main part that I think is different in audio space is that you don't need the scale as much as you need the architectural breakthroughs, the model breakfasts to really make a dent.
And we've been able to do that a couple of times.
And I think the number of people doesn't matter, but the people that you do does.
We think there's maybe 50 to 100 researchers in audio space that could do it.
we think we have probably 10 of them in the company that are some of the best ones.
And I think this obsession of just those people working across
and then actually giving the full focus on the company
on making them actually work on that
and bringing their work to production,
seeing how the users interact back was so important.
So that's I think how we being able to create models better
than some of the top companies out there.
But, you know, the truth is, it's like, to large extent is why they weren't able to do it.
It's also like an interesting, we don't know.
It's like, it's a, it's, they are like, they have such an incredible talent there too.
How do you think at the same time about like open source models?
Anyone you're asking the company, I think we'll say that the same and that's like a narrative we think about.
It's in the long term models will commoditize or the differences between will be negligible.
For some use cases, they will still matter for most like the long, like the most use cases they want.
and they'll be broadly available
and they will be broadly available, exactly.
And we don't know where that is,
whether it's two years, three years, four years,
but it's going to happen at some stage.
Then, of course, you will have a fine-tuning layer
that will matter a lot on top of those models,
but like the base models, I think, will get pretty good.
And that's why for us,
the product piece is so important
from the company perspective,
but also from the value perspective,
because if you have a model, that's great,
but to actually connect your business logic and knowledge
to be able to have the right interface
for creating an ad for your work
or a completely new material.
That's a very different exercise.
But open source models are getting,
if I split into two,
like more of that async content,
narration,
I think narration is pretty much open source is great,
commercial models are great,
differences are getting smaller
on the out-of-the-box quality.
What most of the models haven't figured out,
and I think we wear
is how to make them controllable.
So that's kind of the narration piece.
I think the whole interaction piece
of how you orchestrate the components together,
whether that's cascaded speech-to-text,
the Lent-Tex-to-speech approach
or whether in the future is a fused approach
where you train them together.
I think this is good for customer support,
customer experience,
but it's still away from like conversation like we have
and passing that Turing test.
So I think this is still like at least a year,
like within a year.
And then you'll have,
have like real time dubbing kind of variation of like real time translation conversation.
And I think that's maybe like more two years, within two years away.
You know, a very uncomfortable belief that I, I feel comfortable having this belief,
but I think it's uncommon in the market right now is that actually most advantages in technology,
like they could, they could last you a year or they could last you 10.
But they're not like infinitely defensible.
And if you think about that from a model quality perspective or a product,
perspective, they allow you to like serve the customer better and build momentum and build
scale for some period of time. And actually, that's really powerful over time. But it's not like
a clean forever answer. And so I think that makes, I don't know, business people and investors
uncomfortable. And I mean, it's, it's, it's, it's very true as well like, yeah. The way we, I mean,
the way you think about it, research is head start. This gives us, we can give advantage to the customer
earlier. And it's six, 12 months of advantage. That is also a way for us to build a right
product layer for you to get best of that research. Frequently, we do that in Parles. So the moment
the research is out there, you have the product because we know our initiatives. We know what
the product is that's right. So we have research, product in Parole that extends that. But
the thing that will really give that long-term value is the ecosystem that you create around, whether
that's the brand and distribution, whether that's the collection of voices you can have, the
collection of integrations you can build, the workflows that you can build.
And I think that's the way we can sequence that in our mind, that research, product,
ecosystem that we built.
And research, all it is is a head start and being able to like accelerate the future
a little bit closer.
I think that's a really powerful insight, especially if, you know, the research and the research
team and the company team believe that as well internally.
It's, it's, I think the piece that we, I was like interesting for us is, and I think this is like
the big questions for all companies
of the research and product
is do you wait for a research
or do you do like a product change
or even not only research product companies
like do you wait for someone else to do the research
because the timeline for that isn't clear
is it three months, six months, 12 months.
You don't know exactly what it will do,
which is the hard choice of like do I invest
into product layer or do I just wait more for the research?
So like in our case, we internally let all the product teams
the research initiative so we can paralyze that work.
but we don't hold them that if a product team thinks
we should deliver value to the customer
by doing something different, they can.
And rough rule of time is like three months.
If we think it's going to be longer than three months,
we would probably build it.
If it's less than that, we probably want.
Can you talk about some of the research that you're doing now
and then how you think about like the cadence of delivery
and what's worth working on?
We have now a number of different initiatives
across the audio space and there are kind of two big buckets
and roughly they will relate to that creative and agent side.
On the creative side, what this means with a text-to-speech models that are controllable,
we then are the speech-to-text model that transcribes in a high-accurate way,
but across a low resource languages as well, so covering almost 100 languages,
then created a music model, a fully licensed music model.
And as you think about the future, it's how those models will also interact with some of the visual space.
So that's a lot of effort in how you can get the best of audio
and then potentially combine that with existing video that you have
to really have the best delivery.
And then on the agent side,
it's of course how you optimize the real-time speech-to-text,
real-time text-to-speech.
We just released our speech-to-text model,
scribevee-2, which is under 150 milliseconds,
93.5% accuracy across the top 30 languages on flurors.
And it's only top 30 here because we serve so many others,
but most of the people don't.
So it's beating the,
beating all the models on benchmarks.
But as you think about the future,
it's also the orchestration piece
of how you bring speech to text,
L.M and text to speech. We are releasing,
we'll be releasing over the next couple of months
a new orchestration mechanism
that will lower the end-to-end part
we think in a great way.
But second thing, which is what is so hard,
is it's not going to only allow you to combine those pieces,
but add also the emotional context
of the conversation. So you can actually
respond with the model, and we think
are more expressive in a better way.
And in the future, and something we're investing
is parallelizing a speech-to-speech
more fused approach as well.
And of course, depending on the use case,
if you are like enterprise-reliable use case,
the cascaded approach is the approach for the next year to...
Has more structure, yeah.
More structure, you have more visibility
into each of the steps.
It's reliable.
You can call tools.
If you're having more expressive
and can hallucinate, speech-to-speech might be the choice,
and maybe over time, you'll see them
kind of go one over and other depending on the industry,
but that's like a huge investment on our side,
which is where the foundation of all the platform
and the main part that we are continually investing in
is kind of platform of different models
that combine the best of audio
with some of the best of the other modalities together.
I want to take our last few minutes
and ask you a few questions about just the future
that I think you'll have a really good point of view on
if you think about voice and audio all the time.
What do you think of AI companions?
I think they will be a big thing and exist in a big way,
not something I'm personally excited about
or something that we spend much time on.
But I think the whole line of what's a assistant, companion, character that you enjoy
as part of experience, what kind of blurry and blend to a large extent.
It would be very common, but you're not, like, enthusiastic personally about it.
I'm more excited about, like, more of, um,
the Jarvis version of that or like, like, more of like, I have a super assistant, super
versus the social version. Versus the social version. That's like, I think it's just would be like
such an incredible unlock. And it's also like it's in a, and it's something blending in the
personal life context. I would love to start the day and like someone that understands me and like
start and tell me what's like relevant to me and open the blinds and then like tell me about
the weather and the sunshine. It's going to be and play music straight away. It's going to happen. It's
I haven't. That I'm excited for. I think the companion use cases will mention solving loneliness
and in that part. I think that's one way. Maybe there are like different ways of engaging
people back. I do think there will be like an interesting future. Even if you think about
education where you will have superpower with learning from AI tutors, but I think on the
flip side of that, and I think this will like, that's my personal take, you will have
education, a good percent of time spent with AI tutors, but then explicitly,
explicit percent of time spent without any technology human to human so you can you can kind of
learn that part too yeah i think this is the correct model um both in terms of like emotional guidance
and coaching and um you know uh guardrails right as well as like peer pure yeah um exactly what do you
think about um dictation or what happens in terms of how we like control uh technology that
isn't necessarily personified as well?
Or does it just all become personified?
I think not all personified.
I think like Sam, you know,
communicating with an oven and home probably will like stay pretty static.
Or code I might just.
Yeah, exactly.
Like you don't probably need that much of like additional emotional input.
But I think it's going to be huge parts where like in a way,
what I hope will happen is you will have ability to like stay more immersed in the real
life with the devices going back into the pocket,
back into some version of an attached element,
assuming that's in the right setting.
And that kind of acts on your behalf.
And in many ways, like, let's say dictation.
It's, as Carpati says, decade of agents, let's call it decade.
Then you'll have a decade of robots.
If you are interacting with robots, of course voice will be the input
and the output as one of the key interfaces.
So you will need that dictation as a huge part.
But similarly...
I think the robot's going to be personified.
Yeah, yeah, it's a hundred percent, 100 percent.
Yeah, no, like, I think most of the use cases will be personified.
Okay, last one.
What's like one thing that you've seen already exist today or if you project out a few years
will change about how we interact with content.
Maybe it's like personalized voice content or just something people are going to do
with AI voice that they don't do today or that not everybody knows about.
I think this still the biggest one that hasn't yet kicked into that,
The system is like how education will be done.
I think this goal will, like, I think learning with AI will, with voice where it's like
on your headphone or on a speaker, it's just going to be such a big thing where you have like
your own teacher on demand and who understands you very personified and kind of delivers
the right content through your life.
I think this will be one of the biggest use cases.
And I don't think it happened yet.
I think we have see, of course, some of the commercial partners, but like schools, universities,
how that's deployed in a safeguarded way
in a way that supports the other part of the education,
the social part of education.
I think all of that will evolve.
And maybe there's a cool version of that
where you have Richard Feynman
or all the Einstein
deliver those lecternals
or other teachers that you love.
It would be sick.
It's a great note to end on.
Thanks for doing this, Madi.
Thanks so much.
Find us on Twitter at No Pryor's Pod.
Subscribe to our YouTube channel
if you want to see our faces.
Follow the show on Apple Podcasts,
Spotify or wherever you listen.
That way you get a new episode every week.
And sign up for emails or find transcripts for every episode at no dash priors.com.
