The Vergecast - Everyone will be able to clone their voice in the future
Episode Date: September 14, 2021For the next four Tuesdays, Verge senior reporter Ashley Carman will explore how artificial intelligence and machine learning are shaping the future of a variety of industries. In this episode, Ashl...ey talks to AI companies that are working with voice synthesis to see why they are targeting the field of voice talent and podcasting and what cloning your voice can be used for in the future. Read more This podcast was made by producer Liam James, senior audio director Andru Marino, senior reporter James Vincent, and senior reporter Ashley Carman. Learn more about your ad choices. Visit podcastchoices.com/adchoices
Transcript
Discussion (0)
Support for the show comes from Retool.
Too many companies run critical operations on duct-taped spreadsheets,
Slack workflows, and whatever else they could cobble together.
Not because they want to, but because building internal tools
means weeks of waiting on someone else's backlog.
That's where Retool comes in.
Build custom internal tools just by describing what you need.
Prompts something like,
Build Me a Revenue Dashboard on our Salesforce data.
And Retool actually builds it on your company's data,
in your cloud with enterprise security built in.
Go to retool.com slash Verchcast.
We all need to retool how we build software.
Hey, Vergecast listeners, it's Neelai.
For the next four Tuesdays in the Vergecast feed,
we're going to be doing a little mini-series we made
about the different uses of artificial intelligence
and machine learning in a whole variety of contexts.
It's all hosted by Verge Senior Reporter Ashley Carman,
who is here. Hey, Ashley.
Hello.
We're going to get to the first episode
in a second, but first, give us a broad over you. What is this series about? Yeah. So for a while now,
we've, of course, heard lots of hype over AI, what it can do for us, for our work, and that
basically a lot of tech companies are investing in it. So for this series, we really want to see
how AI could actually be implemented in interesting ways. And in industries, you wouldn't immediately
think of when you think about artificial intelligence. So we're going to be looking at areas like
audio, video, text, and some other places about how they're currently
using AI and how they might use it in the future.
That sounds wide-ranging.
Where are you starting this week in this episode?
So this week, we are talking about voice clones.
Maybe you've heard a little bit about it, like the Anthony Bourdain stuff.
Yeah.
Also, I have to say that Vurchase producer, Andrew Marino, has made a clone of his own voice,
which is terrifying.
Well, we have some surprises in this episode, too.
There might be more terrifying things than Andrew's voice clone.
Did you make a voice clone of yourself?
I don't want to spoil anything.
All right. Well, I'm very excited for this episode.
Here it is, episode one of the Vergecast's AI series.
Roll the tape.
The world today often feels like it's full of digital voices.
With AI assistants like Siri, Amazon Alexa, and Google reading your messages, announcing the weather, and answering trivia.
Here's what I found on the web.
But if you think things are chatting now, just you wait.
The voices of these AI assistants used to be based on real recording.
voice actors spent hours talking in a studio,
and these clips would be cut up and rearranged to create synthetic speech.
But increasingly, these voices are being created using artificial intelligence.
This means we can not only create more realistic computer voices,
but clone the voices of real people much more quickly,
creating endless artificial speech at the touch of a button.
For example, it was surprisingly easy to make a synthetic version of my own voice.
In case you missed that, that was not a single.
me talking that was all made digitally by typing into a computer.
So why would someone want to do this besides the obvious novelty of it?
You might have guessed a reason to make some money.
Hey listeners, what's going on? Kevin Hart here. And I want to talk to you about why.
Do we have to have mac and cheese every night?
Think about it. That's why I recommend thousands of new shows.
This is a promo from Veritone 1, a company that's working on an AI product to create synthetic voices
and make them something the media industry wants to use.
So we've created a platform AI, which at the end of the day turns unstructured data into structured data.
That's Sean King, Executive Vice President at Veritone 1.
So if you're thinking about audio, thinking about video, things that are typically unstructured,
and we make that searchable, discoverable through a host of different cognitive engines that are there from transcription,
speaker detection, speaker separation, and then we provide those tools to, you know, many different industries that are needing that.
And where Veritone plans on really making cash with its Marvel AI product is what they can do with audio marketing.
You know, at the end of the day, we're an ad agency and we specialize in audio and influencer, media.
And with that, we're able to take these tools and be able to help provide better attribution and better efficacy to the ad performance for these campaigns.
In other words, they are making realistic voice clones for voice actors, podcasters, and other celebrities so they can spend less time in the studio recording, while a
companies that hire them can save more production time and money, especially if they need to
re-record a few sentences in a larger project.
You know, the hardest part about someone's voice and being able to use it and being able to expand
upon that is the individual's time. A person becomes the limiting factor in what we're doing.
The time aspect could make sense for voice artists, especially those who aren't able to do voice
acting full time. For example, you know those station IDs you hear on your local radio station
from famous musicians.
Hi, hi, this is Lionel Richie.
This is Billy Joel.
This is pink.
Hey, what's up?
This is Justin Timberlake.
Those promos are actually recorded all at once in a studio somewhere and aired on a bunch of radio stations across the country.
Even though the recording this in batches, it takes a lot of time, time that is not always available for these stars.
So with a company like Veritone, an artist would be able to create a synthetic version of their voice
and use it to create these promos, all without having to go into a recording studio.
We're able to use technology to be able to make that personalization or that localization
and really be able to still use the persons, their trusted voice that the consumer wants
and is accustomed to hearing, but not having to have that person's time to be the limiting factor.
This could also potentially be used for actors who can't use their voice anymore.
Recently, a British company called Synatic created a model of actor Val Kilmer's voice,
which he lost in 2014 after a tracheotomy due to throat cancer.
But now I can express myself again.
I can bring these dreams to you.
Show you this part of myself once more.
That's definitely impressive.
And for voice actors who already have an archive of recordings of their voice,
it could end up being a career saver.
But we wanted to hear directly from voiceover talent ourselves.
Could voice synthesis create a more fruitful business for them?
And do they want this?
The folks who are consummate professionals do not buy that, and no one is talking about this in an excited way.
We talk to Andeia Winslow, a working voice artist and narrator in the VO talent industry.
I do commercial, promos, in-show narration, video games, animation, Voice of God, live events, and audiobooks.
So seven different genres.
Andia says the potential for voice synthesis is talked about a lot in the industry, but she doesn't really see it as useful in her line of work.
Folks who look at voice acting or voiceover narration as a money thing, perhaps they do, but people who look at it as an art, they do not.
When it comes to artistic expression and natural performance, this technology might not be applicable to an industry that relies so deeply on the human element.
For big stuff, things that need breath and life, it's not going to go that way because partly these brands like working with the celebrities they hire, for example.
I can't see this being something like cameo where people, you know, have their voice bank read to adoring fans who pay.
I don't see that happening.
Her take is that voice synthesis would really only work for mass quantity projects.
E-learning, corporate intake, mass transit communication.
That work will probably go ultimately, probably three to 10 years to AI because it's easier to create.
You're not looking for a human element.
You're not looking for humanity in the voice.
You're just looking for the dissemination of information.
And I think those types of jobs, those will be automated and people will lose that market share.
That seems spot on.
It'll likely be cheaper to rent a voice, especially.
especially for industries like transportation or education,
that need to update what they're saying regularly.
This is already being used in train systems in Sweden.
But Veritone also makes the case that voice synthesis
could fill in in situations that are too difficult and expensive to do with humans,
like when trying to overcome language barriers.
Right now we're hearing this podcast and everyone's going to hear us in English,
because we're both sitting here and we're both speaking and we're talking about it.
And so the person that's sitting in Italy is going to hear the same thing that we're talking about today.
Well, there is a future here where we can have our voices and the person who downloads in Italy can hear our voices,
but actually hear us speaking Italian and being able to kind of personalize this podcast and localize it specifically to the user.
And if we're able to do that, what does it mean to the success of the podcast globally?
How many new people are we going to be able to engage with as a result of that?
This could also work for TV and movies.
Perhaps you could watch a James Bond movie in Thai
with Daniel Craig speaking Thai in his own voice because of voice synthesis.
Or you could listen to this podcast in Spanish.
If you're living in an area where you have a high degree of people
that are speaking English, that are speaking Spanish, that are speaking Mandarin,
whatever the different ones are, you know, to be able to take an important statement
and something that you want to share in important news
and be able to have that versioning out there
so that it's more inclusive to everyone that's in your community
is another great opportunity.
And I think at the end of the day,
we're just beginning to scratch the surface.
And Dia remains unconvinced, though,
and warns that we should remain skeptical of technology
that intends to disrupt an industry.
Is disruption always a good thing?
So, like, automation in the workforce,
does automation in voice acting,
what does it do to the creative and the collaborative process?
and like the production ecosystem.
What about all my friends who are mixers and producers
and artistic directors and copyright?
What about them?
There's no clear answers right now.
Everyone is going to have a different opinion.
But that doesn't change the fact
that this technology is being developed
and becoming more widely available.
But how easy is it to create a synthetic voice right now
and how realistic do they actually sound?
We wanted to try this out ourselves with Veritone
and walk you through the whole process with us.
I'm ready to be voice cloned.
First, of course, we need to give verbal consent to Veritone to be able to use my voice to create a synthetic version.
I, Ashley Carman, am aware that recording...
Next, we need to give the AI data.
In this case, that's a bunch of audio of myself speaking, ideally with a consistent audio setup.
Luckily for me, I've been podcasting at the verge for years, so we submitted a compilation of about 90 minutes worth of audio from various episodes of my show, why'd you push that button?
If you ask me, how are you doing today?
And I'm just like, I'm good.
That just comes out of my mouth.
I don't know.
Then we sent that over to Veritone.
That is effectively the training data in which we are feeding into the cognitive engines and the neural networks that are then using the sounds, using their utterances to be able to train the model to be able to create those two different modalities of that person's voice.
There are two different methods we can use to control this voice.
One is a text to speech, which you might be familiar with.
It's where someone can type a statement on my behalf and render it in.
in my voice. What I'm saying next here is totally based off text. What I'm saying here is
totally based off text. How do I sound? When I be able to use this voice in my podcast for now on,
I could use a vacation. Perhaps I can go to the Galapagos Islands to see the giant tortoises,
or the finches or iguanas. I'm sure I would love that. Some of it sounds like what I would
probably say, but then there's phrases like Galapagos Islands that just throw off the entire
sentence. Galapagos Islands, I do not think this could be used as a podcast narrator. I mean,
it could. It just would not be a very good one. It would not make for a good show. Maybe my synthetic
voice can do the podcast advertisements for me. Veritone also offers a speech-to-speech mode. In this method,
someone can record themselves saying what they would want their talent to say, and the AI would
mimic the ways that person speaks, the pauses, the intonation, the emphasis, sort of like a
computerized voice changer. This theoretically would be a more realistic sounding render
of voice synthesis, whereas the text to speech can sound more monotonous and stale or robotic.
Hey, who are you calling stale? In order to train the AI for speech-to-speech mode, it would need
not only the voice talent recordings, but also recordings of the user who would be speaking
in their voice. We intended to test this technology.
for the podcast using my voice and our director Andrews.
But instead, Veritone was only able to demo it with me speaking in the voice of EA sports
and professional hockey announcer Randy Hahn.
The voices of these AI assistants used to be based on real recordings.
Voice actors spent hours talking in a studio and these clips would be cut up and rearranged
to create synthetic speech.
But increasingly, these voices are being created using artificial intelligence.
This means we can not only create.
Create more realistic computer voices, but clone the voices of real people much more quickly,
creating endless artificial speech at the touch of a button.
This sounds definitely more believable than the text to speech, but it's still not totally right.
So the tech right now probably isn't going to work really well for full advertisements or movie scripts.
Instead, it might be more useful to replace only a few words.
And that's how the company Descript has.
been implementing this technology, specifically in the podcast editing space.
We had a pretty basic idea, which was make an audio editor that works like a word processor,
where you can just edit the audio by editing text.
That voice is Andrew Mason, the CEO of Descript, whose audio editing app is available
for anyone on the internet to use right now.
We're trying to create a new kind of audio video editor that not only can live across both
of those mediums, but is also much easier to master.
Descript's editor has a variety of features that helps with editing audio in a text-based program.
Once you upload your audio to Descript, the program will automatically transcribe everything that's been said in the recording.
From there, you can start editing.
If you need to take out a bunch of filler words like ums or us, or parts of a conversation you want to trim down,
in Descript, you can just highlight the text corresponding to that audio section and delete it.
The same goes for pasting certain sections of audio.
the same way you would in Microsoft Word or Google Docs.
And then there's the feature called Overdub.
Overdub completes the equation by also letting people type.
And what I mean by that is you can not only delete words in Descript and have a delete the audio,
you can type words and it will generate audio in your voice.
That's where AI comes in, generating your voice to fill in any gaps.
All you have to do is type.
Like Veritone, Descript requires you,
or the person who is in the audio recording,
to record 30 to 90 minutes of spoken word content
to train the AI to make a synthetic version of the voice
to insert into your project.
For this, I had to read a specific script
from the show Planet Earth.
A hundred years ago, there were one and a half billion people on Earth.
Now, once that audio is uploaded to the script,
optimized and active, in our experience,
this only took a couple of hours,
it's ready for you to use.
All right, so let's do a demo of Descript's overdub feature.
We're going to play a game with you, the audience, to see if you can tell what word we overdubbed in this sentence.
The world today often feels like it's full of digital voices, with AI assistants like Siri, Amazon Alexa, and Google delivering your messages, announcing the weather, and answering trivia.
So which word do you think is synthesized?
Or, better said, which word do you think we just typed into an app and it generated my voice for?
It was the word delivering.
So we're going to play this clip one more time and pay attention to that word.
The world today often feels like it's full of digital voices,
with AI assistants like Siri, Amazon Alexa, and Google delivering your messages,
announcing the weather, and answering trivia.
Now that I pointed it out to you, you probably can hear how this sounds a little funky,
but would you have noticed if I didn't point it out to you?
I'm curious what someone who works and edits audio all day thinks about this.
Like, what could they do to maybe make the sound even more believable?
So we're going to bring in Andrew, our podcast director.
Hey, Andrew.
Hello.
To hear what he thinks.
What do you think?
So what's impressive about it is the quality of audio that it generates.
It's pretty high fidelity.
And it kind of has to be so it can match with the high quality audio recordings of a podcast.
Otherwise, like from other voice synthesis that we've seen, it's kind of low fidelity,
audio. And if you were to splice that end, it would sound a little off, kind of like a lo-fi
MP3 in a high-fidelity audio recording. This is Ashley speaking solely with Descript's
Overdub feature. Okay, so I guess then would you use this? I would try to use it, but I am still
skeptical of using it. In my scenario, we have a studio, we can go to a studio anytime we want
and re-record something.
But in a situation where someone is not able to go into a studio or they have to pay extra
for a studio time, and this is kind of like the only option they have, I think it would work
pretty well.
So we're just using Descripts app for this.
Like that's just like quote unquote, like raw audio.
We haven't edited that clip.
But I'm wondering if in whatever audio editing app you use, if you would be able to actually
make that sound a little bit better somehow.
If I were using this in my podcast, I would be editing it in another program eventually.
So in my case, listening to it, I would try to massage it a little bit to make it sound a little more smooth and unnoticeable.
But I think in Descript's scenario, they want Descript to be kind of all in the box software that you wouldn't be exporting this to another software in massaging it.
So it's not all the way there yet, but it's super impressive.
Right now, we aren't seeing a ton of this technology and used today the way we laid out here.
But when we do, it tends to be pretty controversial.
Recently, a documentary about the late TV personality and chef Anthony Bourdain attracted criticism after it was revealed that the film used a synthetic version of Bordain's voice.
You were successful, and I am successful, and I'm wondering, are you happy?
The director later confirmed that this was made with AI from old recordings of Bordane,
before he died. This brought up a continued discussion around the ethics of voice synthesis and when
it's okay to use. That's why a lot of these companies working in voice synthesis have really tried to
make sure that the person whose voice you're synthesizing knows it's being synthesized and has given
the okay. We've created a pretty bright line on what you're allowed to do using Descript. You can
only copy your own voice. Now, that's mostly just to keep us out of the debate, because the fact
of the matter is that anybody that wants to can go out on the internet and relatively easily
find ways to clone people's voices using other technology out there.
Veritone also stresses consent first before they render anyone's voice. They've even developed
a way to watermark the audio so there's a lesser chance anyone gets fooled or misled. Or if they
are, Veritone can definitively say whether the audio is legitimate or not.
And we partnered with groups like the Open Voice Network, who is part of the Linux
Foundation to help bring better awareness and rules of engagement around synthetic content,
specifically synthetic voice. At the end of the day, the consumer or the in-lucener shouldn't
feel or be tricked in any way, if that makes sense. But it gets trickier if someone is no longer
around to object to how their voice is being used. We have been approached by many people
working on projects like that. But for us, we need to understand and how we begin to work through
those is who's the executor of that individual's estate? Is it their widow or widower? Is it
their estate manager, the executor of the estate, the legal team? It really, again, comes down to
who has the authority of that person's consent and to give that consent. Veritone and Descript
might have a consent-first approach, but not all companies have to operate like that, especially
as this technology becomes more democratized and affordable. A similar conversation with
consent started recently with TikTok's text-to-speeching.
feature. How text messages go with my younger brother, yo sis, yo bro. There was a case of Bev
standing in TikTok, which is owned by Bite Dance, as you know, in which an assign bought her bank
of recorded audio that she had made years prior for the Chinese Institute of Acoustics. And then they
repurposed her text for speech to the social media app TikTok without any notice, without compensation.
So that was a big surprise. Everyone started sending me these videos going, this is you, this is you.
And that's how I found out.
was not about AI, but power dynamics and data rights, a problem that AI could exacerbate.
Once it's easy to make a voice clone of someone, how might that be used in the future?
Could smaller voice stars be forced to sign away their rights to their voice in perpetuity, for example?
These individuals might not have the resources to fight big companies that misuse their voice,
while celebrities would have the agents and lawyers to argue on their behalf.
The union players are going to be protected from this use and misuse,
But folks who are in the margins or not quite ready to or able or disinterested in joining the union,
they're going to face challenges like Bev Standing did for her.
So I think it matters also where you are in your career.
We're left with a lot of questions, though, about how we're going to use this technology.
Maybe someone can approve the use of their voice to be used after they pass,
but who knows what their voice could be used for in the future,
to spread misinformation, endorse a product they went against morally.
And what about the audience?
How and when do we indicate what you're listening to is a synthetic voice and not the real thing.
The field is still new and we're all still figuring it out together.
But at this point, I want to take what we've learned here to discuss it further with a colleague of mine,
James Vincent, the Verge's London-based reporter who writes about artificial intelligence and machine learning,
which includes, of course, voice synthesis.
We're going to take a break, but when we come back, I'll talk to James and we'll chat about the potential of the synthesized voice.
Support for the show comes from Framer.
Framer is an enterprise-grade, no-code website builder,
used by teams at companies like Perplexity and Muro to move faster.
With real-time collaboration and a robust CMS,
with everything you need for great SEO,
not to mention advanced analytics that include integrated A-B testing,
your designers and marketers are empowered to build and maximize your dot-com from day one.
So whether you want to launch a new site,
test a few landing pages or migrate your full.com.
Framer has programs for startups,
scale-ups, and large enterprises to make going from idea to live site as easy and fast as possible.
Learn how you can get more out of your dot com from a Framer specialist
or get started building for free today at framer.com slash verge for 30% off a Framer pro annual plan.
That's Framer.com slash verge for 30% off.
Framer.com slash verge.
Rules and restrictions may apply.
And we're back.
We're here with James Vincent senior reporter at The Verge, whose specialty is AI and machine learning.
Hello, James.
Hello, Ashley.
How are you doing today?
I'm great.
It's always a treat to see you.
Thank you.
So obviously, you have been reporting on AI and machine learning here at the
verge for years. I trust you. You're going to give us the real take. You're going to give us
skepticism, that Rye British wit. I'm ready for it. Obviously in this episode, we're talking about
voice synthesis. And I wanted to hear just from you, there's a lot of hype around this right now,
specifically because of the Anthony Bourdain documentary. We're hearing a lot about it. People are
writing about it. Do you think this industry is something that we need to be paying attention to?
We've just done a whole podcast episode, so I hope the answer is yes. Or do you think this is kind of
overhyped, maybe something that's not going to play as big of an importance in the world going
forward. So, yeah, I mean, I am not as skeptical as you might be expecting me to be. I genuinely think
the technology is here. I think the technology is impressive. And unlike some applications we see in
AI in machine learning, it's much closer to just being out there. You know, you've been speaking to
Veritone. You know, they have a product. It's being used. It's ready to go. And that's quite unusual sometimes
in AI. What I think is overhyped is, when we think about the potential impact this will have,
I think the reason that the, for example, the Anthony Bourdain documentary caused such a huge
discussion, obviously it's the novelty of it, and it's bringing with it a lot of ethical
questions that we've not dealt with. But I think once those have gone past, the actual impact
on the industry will be smaller than we're currently thinking now. But, as I say, the technology
it's here. It's very exciting. I'm super into it. Yeah. And I mean, obviously, whenever
we talk a lot about new technology. Again, here at the verge, we tend to sort of look at the potential
future misuse of it. So having kind of your ear to the ground on the reporting here,
do you think there's enough discussion going on around ways this technology could be misused
and enough forethought going into how to prevent that? I don't know in terms of forethought that's
difficult to say. So one of the big uses for this is going to be fraud. We've already had
reported accounts, only a couple, but they've been trickling out about fraud cases to do with
banks, to do with financial transfers, where someone has created an AI fake of a CEO's voice and
said, yes, I authorise you to send me, you know, 300,000 euros over the wire. And they,
they just believed that and it just happened. And that's wild to me. But I don't think that
necessarily creates a completely new threat model. If you found someone who could do a good
impersonation of your CEO or you convinced them that you were speaking over a crackly phone line and
that's why they sounded weird. That's just social engineering. That happens a lot anyway. So I don't
think this makes a completely new threat out there in the world, but it will make the access to
that sort of attack much easier. And I know, for example, it's something that's a huge problem in the
US with spam calls. And if you start getting, you know, if your parents start getting spam calls,
which sound, I don't know, a little like their daughter or their son.
That's going to be super freaky.
And that's something that could really plausibly happen.
So I think it's something that people need to be aware of.
Yeah, that was my parents' first reaction.
When I played them, my synthetic voices, they were freaked out.
Because it's happened to my grandfather where he got, like, a scary phone call from someone crying, claiming to be my brother.
And he was like, asking for money.
That's crazy.
Imagine that was actually my brother's voice.
I mean, it already pretty much duped my grandpa, like he called my mom.
But, like, still, if it had actually been my brother's voice or my voice, that's terrifying.
Yeah.
And it's going to be one of those things where we start to rethink what information about us is available online.
So I think it's something over the past couple of years.
We're now all quite aware of the fact that, hey, if you're on Facebook and you've got a lot of public photos of you about, then someone could use those for mischief.
They could create a fake account that pretends to be you.
And I think in the future, we're going to now start thinking, oh, is there quite a bit of,
audio of me online that someone could use to create a fake. And now for most of the listeners,
that's probably not going to be a huge problem. For you, Ashley Carman, host of popular
podcast on The Verge, that's actually a huge problem. I mean, are you, does it worry you?
Yes, it does. I'm sorry. I already am anxiety prone, so it's not exactly ideal. But, no,
I do think about that because obviously we have tons of videos of us at the Verge.
Like if you wanted to clone me in any way whatsoever, the data is ripe for the taking.
So enjoy.
Right.
And I think that creates this new level of threat for people you have perhaps, let's say, a semi-public profile.
I don't know how you'd categorize yourself in that.
But obviously, you know, I think as journalists, we do have that.
We're not famous, obviously, but we have information about us that's out there in a way that it isn't for everyone.
And I think it does create a new threat for that sort of individual.
And obviously it's not just journalists.
It's say you're a company CEO.
You know, that's the fraud example I use.
If you have an earnings call, then there's lots of audio of you out there.
Every time you've done an earnings call,
there's going to be recordings of that accessible online transcripts.
And someone can scrape that data very easily and turn that into a new type of attack.
Whether it's something that's being talked about enough, I don't know.
But I think it's like it's one of those problems that as soon as we start seeing more cases of it
that get public discussion like the Anthony Bordane,
thing, then we're going to start seeing reactions to this from these companies.
That's why I think it's great to be talking about this stuff now, because the more people
know about it, the less of a threat it is.
I'm curious also about the economics of this, like whether this will be something that
will be democratized for everybody that like anyone in their mom, if they are willing to
record 90 minutes of audio, could theoretically make a voice clone of themselves, or if this
is going to kind of stay at the higher end of cost where.
you have to kind of be willing to dedicate the time and money and also maybe just use it for economic gains,
like for advertisement reading like we've talked about.
I think it'll really come down in terms of cost and training data needed.
I know, you know, you've had to record 60 minutes, 90 minutes of audio to get your personal clones,
but I think that'll come down and probably will see it with 10 minutes of audio or something like that.
Oh, actually, you just have a phone call and it gets enough.
I think that'll definitely happen.
I don't think it'll be economically useful to everyone, but I think it will be economically useful to everyone,
but I think it will be interesting and fun.
You know, creating a voice clone that, for example, you know, say you're playing a video game
and you design your character at the beginning of it and maybe you make them look like you
and maybe you record five minutes of audio to make it sound like you.
So when you're out in the video game world, your character speaks with your voice.
And I feel applications like that could become quite common and quite accessible, you know,
within five, ten years or something like that.
But, you know, it'll take that long to trickle in.
think that's a very plausible time frame. But another one would be a sort of story time app for children
in which a parent would create a voice clone of themselves. And then they could feed that into a
little box, a little app that then reads all their child's favorite stories in their voice.
So if they are traveling in another country, if they are unavailable in some way, then they will
still be able to speak to their child. And you could actually have that with a sort of voice assistant
where your child gets to speak to a voice assistant that sounds like mum and dad. And that won't
be everyone's cup of tea. Definitely not, but I can imagine some people who would like that. Yeah,
so I think there are lots of these little use cases that will come out there. I think the big
impact is going to be in the world of celebrity and in the world of media and entertainment,
which you've obviously discussed already. But I think, as ever with this tech, there are always
unexpected ways that it shows up in the world. And I'm really, I'm really interested to see how
this one does show up. It's interesting to me because in this episode we talk about how some of these
celebrities or voiceover talent are going to have to protect themselves just through contracts.
Like they have lawyers, agents, everybody who's willing to kind of make sure that their voice is
protected. And I'm just wondering if like the practical person might have to start thinking about
these things like when they die. If in their will they're going to be like the rights to my voice,
not that they're famous or anything, but just purely like, you know, you can leave your Facebook
page to your family. I'm going to leave my voice recordings to my family to not do what they wish
with it. Absolutely. I mean, I feel there's
already companies that, as you say, that look after your digital assets. And I think the voice
will be added to that pile. And there may be some people who are quite happy with saying,
you know what, I'm not going to get to speak to my great grandkids, but they might want to
create my children, might want to create a voice clone of me so they can speak to their great
grandpa. But there might be some people who are uncomfortable with that. And yeah, they want to
include that sort of provision in their will that they are not to be reanimated using this AI
technology. I can absolutely see that happening.
So I've obviously made some voice clones of myself.
Are you thinking about, do you want to make your own voice clone?
Have you done it already?
I very much want to make my own voice clone.
I'm actually in the process of making one now.
And then I can have conversations with myself every day of the week.
Because actually my concern here is I'm like, no distavaritone.
But I'm like, okay, after this episode's done, I got to email them and tell them to delete my data ASAP.
So that doesn't concern you as much.
Okay, now you've said it.
Yeah, it really concerns me.
You know, yeah, look, I would like a clone, and I would keep it in a little box on my desktop,
and no one would ever be able to use it but me.
But you're totally right.
Do I want it to be out there?
Would I be comfortable if there was a website that said anyone can talk like James Vincent?
God knows why they'd want to.
But anyone can do it, and they can type in it and make me say whatever.
That would make me hugely uncomfortable.
No, I would, thank you.
I'm going to go delete all that data.
Exactly.
It's the internet-connected part of it.
I think that's a little scarier.
Like, again, if you just had this little effect in some Adobe program where you could just be like,
today I'm going to turn on my voice.
Okay, that's sort of fun.
But the idea that it might have to live elsewhere is the scary part, I guess.
Yeah.
It's when it's outside your control, that's when it becomes a threat.
Right.
Well, what a lovely positive note to end on?
No, but this has been a great discussion.
I really, really appreciate you coming on.
And everyone who's listening to this episode is going to hear you in our future episodes.
So hopefully they tune in to hear more of your thoughts.
Thank you so much for helping out and being on and giving us, like I said, the real take.
No problem, Ashley.
Absolute pleasure to talk.
And I look forward to speaking again in the future.
Or me or my clone, who knows.
Thanks for listening to the first episode of the Vergecast AI series.
This podcast was made by producer Liam James, senior audio director, Andrew Marino,
senior reporter James Vincent, and me, senior reporter Ashley Carman.
See you next week.
