The Vergecast - Everyone will be able to clone their voice in the future

Starting point is 00:00:00 Support for the show comes from Retool. Too many companies run critical operations on duct-taped spreadsheets, Slack workflows, and whatever else they could cobble together. Not because they want to, but because building internal tools means weeks of waiting on someone else's backlog. That's where Retool comes in. Build custom internal tools just by describing what you need. Prompts something like,

Starting point is 00:00:22 Build Me a Revenue Dashboard on our Salesforce data. And Retool actually builds it on your company's data, in your cloud with enterprise security built in. Go to retool.com slash Verchcast. We all need to retool how we build software. Hey, Vergecast listeners, it's Neelai. For the next four Tuesdays in the Vergecast feed, we're going to be doing a little mini-series we made

Starting point is 00:00:48 about the different uses of artificial intelligence and machine learning in a whole variety of contexts. It's all hosted by Verge Senior Reporter Ashley Carman, who is here. Hey, Ashley. Hello. We're going to get to the first episode in a second, but first, give us a broad over you. What is this series about? Yeah. So for a while now, we've, of course, heard lots of hype over AI, what it can do for us, for our work, and that

Starting point is 00:01:11 basically a lot of tech companies are investing in it. So for this series, we really want to see how AI could actually be implemented in interesting ways. And in industries, you wouldn't immediately think of when you think about artificial intelligence. So we're going to be looking at areas like audio, video, text, and some other places about how they're currently using AI and how they might use it in the future. That sounds wide-ranging. Where are you starting this week in this episode? So this week, we are talking about voice clones.

Starting point is 00:01:41 Maybe you've heard a little bit about it, like the Anthony Bourdain stuff. Yeah. Also, I have to say that Vurchase producer, Andrew Marino, has made a clone of his own voice, which is terrifying. Well, we have some surprises in this episode, too. There might be more terrifying things than Andrew's voice clone. Did you make a voice clone of yourself? I don't want to spoil anything.

Starting point is 00:02:02 All right. Well, I'm very excited for this episode. Here it is, episode one of the Vergecast's AI series. Roll the tape. The world today often feels like it's full of digital voices. With AI assistants like Siri, Amazon Alexa, and Google reading your messages, announcing the weather, and answering trivia. Here's what I found on the web. But if you think things are chatting now, just you wait. The voices of these AI assistants used to be based on real recording.

Starting point is 00:02:31 voice actors spent hours talking in a studio, and these clips would be cut up and rearranged to create synthetic speech. But increasingly, these voices are being created using artificial intelligence. This means we can not only create more realistic computer voices, but clone the voices of real people much more quickly, creating endless artificial speech at the touch of a button. For example, it was surprisingly easy to make a synthetic version of my own voice. In case you missed that, that was not a single.

Starting point is 00:03:03 me talking that was all made digitally by typing into a computer. So why would someone want to do this besides the obvious novelty of it? You might have guessed a reason to make some money. Hey listeners, what's going on? Kevin Hart here. And I want to talk to you about why. Do we have to have mac and cheese every night? Think about it. That's why I recommend thousands of new shows. This is a promo from Veritone 1, a company that's working on an AI product to create synthetic voices and make them something the media industry wants to use.

Starting point is 00:03:35 So we've created a platform AI, which at the end of the day turns unstructured data into structured data. That's Sean King, Executive Vice President at Veritone 1. So if you're thinking about audio, thinking about video, things that are typically unstructured, and we make that searchable, discoverable through a host of different cognitive engines that are there from transcription, speaker detection, speaker separation, and then we provide those tools to, you know, many different industries that are needing that. And where Veritone plans on really making cash with its Marvel AI product is what they can do with audio marketing. You know, at the end of the day, we're an ad agency and we specialize in audio and influencer, media. And with that, we're able to take these tools and be able to help provide better attribution and better efficacy to the ad performance for these campaigns.

Starting point is 00:04:22 In other words, they are making realistic voice clones for voice actors, podcasters, and other celebrities so they can spend less time in the studio recording, while a companies that hire them can save more production time and money, especially if they need to re-record a few sentences in a larger project. You know, the hardest part about someone's voice and being able to use it and being able to expand upon that is the individual's time. A person becomes the limiting factor in what we're doing. The time aspect could make sense for voice artists, especially those who aren't able to do voice acting full time. For example, you know those station IDs you hear on your local radio station from famous musicians.

Starting point is 00:05:01 Hi, hi, this is Lionel Richie. This is Billy Joel. This is pink. Hey, what's up? This is Justin Timberlake. Those promos are actually recorded all at once in a studio somewhere and aired on a bunch of radio stations across the country. Even though the recording this in batches, it takes a lot of time, time that is not always available for these stars. So with a company like Veritone, an artist would be able to create a synthetic version of their voice

Starting point is 00:05:24 and use it to create these promos, all without having to go into a recording studio. We're able to use technology to be able to make that personalization or that localization and really be able to still use the persons, their trusted voice that the consumer wants and is accustomed to hearing, but not having to have that person's time to be the limiting factor. This could also potentially be used for actors who can't use their voice anymore. Recently, a British company called Synatic created a model of actor Val Kilmer's voice, which he lost in 2014 after a tracheotomy due to throat cancer. But now I can express myself again.

Starting point is 00:06:03 I can bring these dreams to you. Show you this part of myself once more. That's definitely impressive. And for voice actors who already have an archive of recordings of their voice, it could end up being a career saver. But we wanted to hear directly from voiceover talent ourselves. Could voice synthesis create a more fruitful business for them? And do they want this?

Starting point is 00:06:25 The folks who are consummate professionals do not buy that, and no one is talking about this in an excited way. We talk to Andeia Winslow, a working voice artist and narrator in the VO talent industry. I do commercial, promos, in-show narration, video games, animation, Voice of God, live events, and audiobooks. So seven different genres. Andia says the potential for voice synthesis is talked about a lot in the industry, but she doesn't really see it as useful in her line of work. Folks who look at voice acting or voiceover narration as a money thing, perhaps they do, but people who look at it as an art, they do not. When it comes to artistic expression and natural performance, this technology might not be applicable to an industry that relies so deeply on the human element. For big stuff, things that need breath and life, it's not going to go that way because partly these brands like working with the celebrities they hire, for example.

Starting point is 00:07:15 I can't see this being something like cameo where people, you know, have their voice bank read to adoring fans who pay. I don't see that happening. Her take is that voice synthesis would really only work for mass quantity projects. E-learning, corporate intake, mass transit communication. That work will probably go ultimately, probably three to 10 years to AI because it's easier to create. You're not looking for a human element. You're not looking for humanity in the voice. You're just looking for the dissemination of information.

Starting point is 00:07:42 And I think those types of jobs, those will be automated and people will lose that market share. That seems spot on. It'll likely be cheaper to rent a voice, especially. especially for industries like transportation or education, that need to update what they're saying regularly. This is already being used in train systems in Sweden. But Veritone also makes the case that voice synthesis could fill in in situations that are too difficult and expensive to do with humans,

Starting point is 00:08:14 like when trying to overcome language barriers. Right now we're hearing this podcast and everyone's going to hear us in English, because we're both sitting here and we're both speaking and we're talking about it. And so the person that's sitting in Italy is going to hear the same thing that we're talking about today. Well, there is a future here where we can have our voices and the person who downloads in Italy can hear our voices, but actually hear us speaking Italian and being able to kind of personalize this podcast and localize it specifically to the user. And if we're able to do that, what does it mean to the success of the podcast globally? How many new people are we going to be able to engage with as a result of that?

Starting point is 00:08:50 This could also work for TV and movies. Perhaps you could watch a James Bond movie in Thai with Daniel Craig speaking Thai in his own voice because of voice synthesis. Or you could listen to this podcast in Spanish. If you're living in an area where you have a high degree of people that are speaking English, that are speaking Spanish, that are speaking Mandarin, whatever the different ones are, you know, to be able to take an important statement and something that you want to share in important news

Starting point is 00:09:21 and be able to have that versioning out there so that it's more inclusive to everyone that's in your community is another great opportunity. And I think at the end of the day, we're just beginning to scratch the surface. And Dia remains unconvinced, though, and warns that we should remain skeptical of technology that intends to disrupt an industry.

Starting point is 00:09:42 Is disruption always a good thing? So, like, automation in the workforce, does automation in voice acting, what does it do to the creative and the collaborative process? and like the production ecosystem. What about all my friends who are mixers and producers and artistic directors and copyright? What about them?

Starting point is 00:09:59 There's no clear answers right now. Everyone is going to have a different opinion. But that doesn't change the fact that this technology is being developed and becoming more widely available. But how easy is it to create a synthetic voice right now and how realistic do they actually sound? We wanted to try this out ourselves with Veritone

Starting point is 00:10:16 and walk you through the whole process with us. I'm ready to be voice cloned. First, of course, we need to give verbal consent to Veritone to be able to use my voice to create a synthetic version. I, Ashley Carman, am aware that recording... Next, we need to give the AI data. In this case, that's a bunch of audio of myself speaking, ideally with a consistent audio setup. Luckily for me, I've been podcasting at the verge for years, so we submitted a compilation of about 90 minutes worth of audio from various episodes of my show, why'd you push that button? If you ask me, how are you doing today?

Starting point is 00:10:50 And I'm just like, I'm good. That just comes out of my mouth. I don't know. Then we sent that over to Veritone. That is effectively the training data in which we are feeding into the cognitive engines and the neural networks that are then using the sounds, using their utterances to be able to train the model to be able to create those two different modalities of that person's voice. There are two different methods we can use to control this voice. One is a text to speech, which you might be familiar with. It's where someone can type a statement on my behalf and render it in.

Starting point is 00:11:20 in my voice. What I'm saying next here is totally based off text. What I'm saying here is totally based off text. How do I sound? When I be able to use this voice in my podcast for now on, I could use a vacation. Perhaps I can go to the Galapagos Islands to see the giant tortoises, or the finches or iguanas. I'm sure I would love that. Some of it sounds like what I would probably say, but then there's phrases like Galapagos Islands that just throw off the entire sentence. Galapagos Islands, I do not think this could be used as a podcast narrator. I mean, it could. It just would not be a very good one. It would not make for a good show. Maybe my synthetic voice can do the podcast advertisements for me. Veritone also offers a speech-to-speech mode. In this method,

Starting point is 00:12:12 someone can record themselves saying what they would want their talent to say, and the AI would mimic the ways that person speaks, the pauses, the intonation, the emphasis, sort of like a computerized voice changer. This theoretically would be a more realistic sounding render of voice synthesis, whereas the text to speech can sound more monotonous and stale or robotic. Hey, who are you calling stale? In order to train the AI for speech-to-speech mode, it would need not only the voice talent recordings, but also recordings of the user who would be speaking in their voice. We intended to test this technology. for the podcast using my voice and our director Andrews.

Starting point is 00:12:50 But instead, Veritone was only able to demo it with me speaking in the voice of EA sports and professional hockey announcer Randy Hahn. The voices of these AI assistants used to be based on real recordings. Voice actors spent hours talking in a studio and these clips would be cut up and rearranged to create synthetic speech. But increasingly, these voices are being created using artificial intelligence. This means we can not only create. Create more realistic computer voices, but clone the voices of real people much more quickly,

Starting point is 00:13:22 creating endless artificial speech at the touch of a button. This sounds definitely more believable than the text to speech, but it's still not totally right. So the tech right now probably isn't going to work really well for full advertisements or movie scripts. Instead, it might be more useful to replace only a few words. And that's how the company Descript has. been implementing this technology, specifically in the podcast editing space. We had a pretty basic idea, which was make an audio editor that works like a word processor, where you can just edit the audio by editing text.

Starting point is 00:14:01 That voice is Andrew Mason, the CEO of Descript, whose audio editing app is available for anyone on the internet to use right now. We're trying to create a new kind of audio video editor that not only can live across both of those mediums, but is also much easier to master. Descript's editor has a variety of features that helps with editing audio in a text-based program. Once you upload your audio to Descript, the program will automatically transcribe everything that's been said in the recording. From there, you can start editing. If you need to take out a bunch of filler words like ums or us, or parts of a conversation you want to trim down,

Starting point is 00:14:39 in Descript, you can just highlight the text corresponding to that audio section and delete it. The same goes for pasting certain sections of audio. the same way you would in Microsoft Word or Google Docs. And then there's the feature called Overdub. Overdub completes the equation by also letting people type. And what I mean by that is you can not only delete words in Descript and have a delete the audio, you can type words and it will generate audio in your voice. That's where AI comes in, generating your voice to fill in any gaps.

Starting point is 00:15:12 All you have to do is type. Like Veritone, Descript requires you, or the person who is in the audio recording, to record 30 to 90 minutes of spoken word content to train the AI to make a synthetic version of the voice to insert into your project. For this, I had to read a specific script from the show Planet Earth.

Starting point is 00:15:32 A hundred years ago, there were one and a half billion people on Earth. Now, once that audio is uploaded to the script, optimized and active, in our experience, this only took a couple of hours, it's ready for you to use. All right, so let's do a demo of Descript's overdub feature. We're going to play a game with you, the audience, to see if you can tell what word we overdubbed in this sentence. The world today often feels like it's full of digital voices, with AI assistants like Siri, Amazon Alexa, and Google delivering your messages, announcing the weather, and answering trivia.

Starting point is 00:16:10 So which word do you think is synthesized? Or, better said, which word do you think we just typed into an app and it generated my voice for? It was the word delivering. So we're going to play this clip one more time and pay attention to that word. The world today often feels like it's full of digital voices, with AI assistants like Siri, Amazon Alexa, and Google delivering your messages, announcing the weather, and answering trivia. Now that I pointed it out to you, you probably can hear how this sounds a little funky,

Starting point is 00:16:39 but would you have noticed if I didn't point it out to you? I'm curious what someone who works and edits audio all day thinks about this. Like, what could they do to maybe make the sound even more believable? So we're going to bring in Andrew, our podcast director. Hey, Andrew. Hello. To hear what he thinks. What do you think?

Starting point is 00:16:57 So what's impressive about it is the quality of audio that it generates. It's pretty high fidelity. And it kind of has to be so it can match with the high quality audio recordings of a podcast. Otherwise, like from other voice synthesis that we've seen, it's kind of low fidelity, audio. And if you were to splice that end, it would sound a little off, kind of like a lo-fi MP3 in a high-fidelity audio recording. This is Ashley speaking solely with Descript's Overdub feature. Okay, so I guess then would you use this? I would try to use it, but I am still skeptical of using it. In my scenario, we have a studio, we can go to a studio anytime we want

Starting point is 00:17:42 and re-record something. But in a situation where someone is not able to go into a studio or they have to pay extra for a studio time, and this is kind of like the only option they have, I think it would work pretty well. So we're just using Descripts app for this. Like that's just like quote unquote, like raw audio. We haven't edited that clip. But I'm wondering if in whatever audio editing app you use, if you would be able to actually

Starting point is 00:18:07 make that sound a little bit better somehow. If I were using this in my podcast, I would be editing it in another program eventually. So in my case, listening to it, I would try to massage it a little bit to make it sound a little more smooth and unnoticeable. But I think in Descript's scenario, they want Descript to be kind of all in the box software that you wouldn't be exporting this to another software in massaging it. So it's not all the way there yet, but it's super impressive. Right now, we aren't seeing a ton of this technology and used today the way we laid out here. But when we do, it tends to be pretty controversial. Recently, a documentary about the late TV personality and chef Anthony Bourdain attracted criticism after it was revealed that the film used a synthetic version of Bordain's voice.

Starting point is 00:19:06 You were successful, and I am successful, and I'm wondering, are you happy? The director later confirmed that this was made with AI from old recordings of Bordane, before he died. This brought up a continued discussion around the ethics of voice synthesis and when it's okay to use. That's why a lot of these companies working in voice synthesis have really tried to make sure that the person whose voice you're synthesizing knows it's being synthesized and has given the okay. We've created a pretty bright line on what you're allowed to do using Descript. You can only copy your own voice. Now, that's mostly just to keep us out of the debate, because the fact of the matter is that anybody that wants to can go out on the internet and relatively easily

Starting point is 00:19:52 find ways to clone people's voices using other technology out there. Veritone also stresses consent first before they render anyone's voice. They've even developed a way to watermark the audio so there's a lesser chance anyone gets fooled or misled. Or if they are, Veritone can definitively say whether the audio is legitimate or not. And we partnered with groups like the Open Voice Network, who is part of the Linux Foundation to help bring better awareness and rules of engagement around synthetic content, specifically synthetic voice. At the end of the day, the consumer or the in-lucener shouldn't feel or be tricked in any way, if that makes sense. But it gets trickier if someone is no longer

Starting point is 00:20:35 around to object to how their voice is being used. We have been approached by many people working on projects like that. But for us, we need to understand and how we begin to work through those is who's the executor of that individual's estate? Is it their widow or widower? Is it their estate manager, the executor of the estate, the legal team? It really, again, comes down to who has the authority of that person's consent and to give that consent. Veritone and Descript might have a consent-first approach, but not all companies have to operate like that, especially as this technology becomes more democratized and affordable. A similar conversation with consent started recently with TikTok's text-to-speeching.

Starting point is 00:21:16 feature. How text messages go with my younger brother, yo sis, yo bro. There was a case of Bev standing in TikTok, which is owned by Bite Dance, as you know, in which an assign bought her bank of recorded audio that she had made years prior for the Chinese Institute of Acoustics. And then they repurposed her text for speech to the social media app TikTok without any notice, without compensation. So that was a big surprise. Everyone started sending me these videos going, this is you, this is you. And that's how I found out. was not about AI, but power dynamics and data rights, a problem that AI could exacerbate. Once it's easy to make a voice clone of someone, how might that be used in the future?

Starting point is 00:21:57 Could smaller voice stars be forced to sign away their rights to their voice in perpetuity, for example? These individuals might not have the resources to fight big companies that misuse their voice, while celebrities would have the agents and lawyers to argue on their behalf. The union players are going to be protected from this use and misuse, But folks who are in the margins or not quite ready to or able or disinterested in joining the union, they're going to face challenges like Bev Standing did for her. So I think it matters also where you are in your career. We're left with a lot of questions, though, about how we're going to use this technology.

Starting point is 00:22:32 Maybe someone can approve the use of their voice to be used after they pass, but who knows what their voice could be used for in the future, to spread misinformation, endorse a product they went against morally. And what about the audience? How and when do we indicate what you're listening to is a synthetic voice and not the real thing. The field is still new and we're all still figuring it out together. But at this point, I want to take what we've learned here to discuss it further with a colleague of mine, James Vincent, the Verge's London-based reporter who writes about artificial intelligence and machine learning,

Starting point is 00:23:07 which includes, of course, voice synthesis. We're going to take a break, but when we come back, I'll talk to James and we'll chat about the potential of the synthesized voice. Support for the show comes from Framer. Framer is an enterprise-grade, no-code website builder, used by teams at companies like Perplexity and Muro to move faster. With real-time collaboration and a robust CMS, with everything you need for great SEO, not to mention advanced analytics that include integrated A-B testing,

Starting point is 00:23:45 your designers and marketers are empowered to build and maximize your dot-com from day one. So whether you want to launch a new site, test a few landing pages or migrate your full.com. Framer has programs for startups, scale-ups, and large enterprises to make going from idea to live site as easy and fast as possible. Learn how you can get more out of your dot com from a Framer specialist or get started building for free today at framer.com slash verge for 30% off a Framer pro annual plan. That's Framer.com slash verge for 30% off.

Starting point is 00:24:25 Framer.com slash verge. Rules and restrictions may apply. And we're back. We're here with James Vincent senior reporter at The Verge, whose specialty is AI and machine learning. Hello, James. Hello, Ashley. How are you doing today? I'm great.

Starting point is 00:24:54 It's always a treat to see you. Thank you. So obviously, you have been reporting on AI and machine learning here at the verge for years. I trust you. You're going to give us the real take. You're going to give us skepticism, that Rye British wit. I'm ready for it. Obviously in this episode, we're talking about voice synthesis. And I wanted to hear just from you, there's a lot of hype around this right now, specifically because of the Anthony Bourdain documentary. We're hearing a lot about it. People are writing about it. Do you think this industry is something that we need to be paying attention to?

Starting point is 00:25:25 We've just done a whole podcast episode, so I hope the answer is yes. Or do you think this is kind of overhyped, maybe something that's not going to play as big of an importance in the world going forward. So, yeah, I mean, I am not as skeptical as you might be expecting me to be. I genuinely think the technology is here. I think the technology is impressive. And unlike some applications we see in AI in machine learning, it's much closer to just being out there. You know, you've been speaking to Veritone. You know, they have a product. It's being used. It's ready to go. And that's quite unusual sometimes in AI. What I think is overhyped is, when we think about the potential impact this will have, I think the reason that the, for example, the Anthony Bourdain documentary caused such a huge

Starting point is 00:26:09 discussion, obviously it's the novelty of it, and it's bringing with it a lot of ethical questions that we've not dealt with. But I think once those have gone past, the actual impact on the industry will be smaller than we're currently thinking now. But, as I say, the technology it's here. It's very exciting. I'm super into it. Yeah. And I mean, obviously, whenever we talk a lot about new technology. Again, here at the verge, we tend to sort of look at the potential future misuse of it. So having kind of your ear to the ground on the reporting here, do you think there's enough discussion going on around ways this technology could be misused and enough forethought going into how to prevent that? I don't know in terms of forethought that's

Starting point is 00:26:49 difficult to say. So one of the big uses for this is going to be fraud. We've already had reported accounts, only a couple, but they've been trickling out about fraud cases to do with banks, to do with financial transfers, where someone has created an AI fake of a CEO's voice and said, yes, I authorise you to send me, you know, 300,000 euros over the wire. And they, they just believed that and it just happened. And that's wild to me. But I don't think that necessarily creates a completely new threat model. If you found someone who could do a good impersonation of your CEO or you convinced them that you were speaking over a crackly phone line and that's why they sounded weird. That's just social engineering. That happens a lot anyway. So I don't

Starting point is 00:27:29 think this makes a completely new threat out there in the world, but it will make the access to that sort of attack much easier. And I know, for example, it's something that's a huge problem in the US with spam calls. And if you start getting, you know, if your parents start getting spam calls, which sound, I don't know, a little like their daughter or their son. That's going to be super freaky. And that's something that could really plausibly happen. So I think it's something that people need to be aware of. Yeah, that was my parents' first reaction.

Starting point is 00:27:58 When I played them, my synthetic voices, they were freaked out. Because it's happened to my grandfather where he got, like, a scary phone call from someone crying, claiming to be my brother. And he was like, asking for money. That's crazy. Imagine that was actually my brother's voice. I mean, it already pretty much duped my grandpa, like he called my mom. But, like, still, if it had actually been my brother's voice or my voice, that's terrifying. Yeah.

Starting point is 00:28:20 And it's going to be one of those things where we start to rethink what information about us is available online. So I think it's something over the past couple of years. We're now all quite aware of the fact that, hey, if you're on Facebook and you've got a lot of public photos of you about, then someone could use those for mischief. They could create a fake account that pretends to be you. And I think in the future, we're going to now start thinking, oh, is there quite a bit of, audio of me online that someone could use to create a fake. And now for most of the listeners, that's probably not going to be a huge problem. For you, Ashley Carman, host of popular podcast on The Verge, that's actually a huge problem. I mean, are you, does it worry you?

Starting point is 00:28:59 Yes, it does. I'm sorry. I already am anxiety prone, so it's not exactly ideal. But, no, I do think about that because obviously we have tons of videos of us at the Verge. Like if you wanted to clone me in any way whatsoever, the data is ripe for the taking. So enjoy. Right. And I think that creates this new level of threat for people you have perhaps, let's say, a semi-public profile. I don't know how you'd categorize yourself in that. But obviously, you know, I think as journalists, we do have that.

Starting point is 00:29:30 We're not famous, obviously, but we have information about us that's out there in a way that it isn't for everyone. And I think it does create a new threat for that sort of individual. And obviously it's not just journalists. It's say you're a company CEO. You know, that's the fraud example I use. If you have an earnings call, then there's lots of audio of you out there. Every time you've done an earnings call, there's going to be recordings of that accessible online transcripts.

Starting point is 00:29:54 And someone can scrape that data very easily and turn that into a new type of attack. Whether it's something that's being talked about enough, I don't know. But I think it's like it's one of those problems that as soon as we start seeing more cases of it that get public discussion like the Anthony Bordane, thing, then we're going to start seeing reactions to this from these companies. That's why I think it's great to be talking about this stuff now, because the more people know about it, the less of a threat it is. I'm curious also about the economics of this, like whether this will be something that

Starting point is 00:30:27 will be democratized for everybody that like anyone in their mom, if they are willing to record 90 minutes of audio, could theoretically make a voice clone of themselves, or if this is going to kind of stay at the higher end of cost where. you have to kind of be willing to dedicate the time and money and also maybe just use it for economic gains, like for advertisement reading like we've talked about. I think it'll really come down in terms of cost and training data needed. I know, you know, you've had to record 60 minutes, 90 minutes of audio to get your personal clones, but I think that'll come down and probably will see it with 10 minutes of audio or something like that.

Starting point is 00:31:02 Oh, actually, you just have a phone call and it gets enough. I think that'll definitely happen. I don't think it'll be economically useful to everyone, but I think it will be economically useful to everyone, but I think it will be interesting and fun. You know, creating a voice clone that, for example, you know, say you're playing a video game and you design your character at the beginning of it and maybe you make them look like you and maybe you record five minutes of audio to make it sound like you. So when you're out in the video game world, your character speaks with your voice.

Starting point is 00:31:29 And I feel applications like that could become quite common and quite accessible, you know, within five, ten years or something like that. But, you know, it'll take that long to trickle in. think that's a very plausible time frame. But another one would be a sort of story time app for children in which a parent would create a voice clone of themselves. And then they could feed that into a little box, a little app that then reads all their child's favorite stories in their voice. So if they are traveling in another country, if they are unavailable in some way, then they will still be able to speak to their child. And you could actually have that with a sort of voice assistant

Starting point is 00:32:03 where your child gets to speak to a voice assistant that sounds like mum and dad. And that won't be everyone's cup of tea. Definitely not, but I can imagine some people who would like that. Yeah, so I think there are lots of these little use cases that will come out there. I think the big impact is going to be in the world of celebrity and in the world of media and entertainment, which you've obviously discussed already. But I think, as ever with this tech, there are always unexpected ways that it shows up in the world. And I'm really, I'm really interested to see how this one does show up. It's interesting to me because in this episode we talk about how some of these celebrities or voiceover talent are going to have to protect themselves just through contracts.

Starting point is 00:32:42 Like they have lawyers, agents, everybody who's willing to kind of make sure that their voice is protected. And I'm just wondering if like the practical person might have to start thinking about these things like when they die. If in their will they're going to be like the rights to my voice, not that they're famous or anything, but just purely like, you know, you can leave your Facebook page to your family. I'm going to leave my voice recordings to my family to not do what they wish with it. Absolutely. I mean, I feel there's already companies that, as you say, that look after your digital assets. And I think the voice will be added to that pile. And there may be some people who are quite happy with saying,

Starting point is 00:33:16 you know what, I'm not going to get to speak to my great grandkids, but they might want to create my children, might want to create a voice clone of me so they can speak to their great grandpa. But there might be some people who are uncomfortable with that. And yeah, they want to include that sort of provision in their will that they are not to be reanimated using this AI technology. I can absolutely see that happening. So I've obviously made some voice clones of myself. Are you thinking about, do you want to make your own voice clone? Have you done it already?

Starting point is 00:33:45 I very much want to make my own voice clone. I'm actually in the process of making one now. And then I can have conversations with myself every day of the week. Because actually my concern here is I'm like, no distavaritone. But I'm like, okay, after this episode's done, I got to email them and tell them to delete my data ASAP. So that doesn't concern you as much. Okay, now you've said it. Yeah, it really concerns me.

Starting point is 00:34:08 You know, yeah, look, I would like a clone, and I would keep it in a little box on my desktop, and no one would ever be able to use it but me. But you're totally right. Do I want it to be out there? Would I be comfortable if there was a website that said anyone can talk like James Vincent? God knows why they'd want to. But anyone can do it, and they can type in it and make me say whatever. That would make me hugely uncomfortable.

Starting point is 00:34:29 No, I would, thank you. I'm going to go delete all that data. Exactly. It's the internet-connected part of it. I think that's a little scarier. Like, again, if you just had this little effect in some Adobe program where you could just be like, today I'm going to turn on my voice. Okay, that's sort of fun.

Starting point is 00:34:44 But the idea that it might have to live elsewhere is the scary part, I guess. Yeah. It's when it's outside your control, that's when it becomes a threat. Right. Well, what a lovely positive note to end on? No, but this has been a great discussion. I really, really appreciate you coming on. And everyone who's listening to this episode is going to hear you in our future episodes.

Starting point is 00:35:04 So hopefully they tune in to hear more of your thoughts. Thank you so much for helping out and being on and giving us, like I said, the real take. No problem, Ashley. Absolute pleasure to talk. And I look forward to speaking again in the future. Or me or my clone, who knows. Thanks for listening to the first episode of the Vergecast AI series. This podcast was made by producer Liam James, senior audio director, Andrew Marino,

Starting point is 00:35:36 senior reporter James Vincent, and me, senior reporter Ashley Carman. See you next week.

The Vergecast - Everyone will be able to clone their voice in the future

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.