The Vergecast - AI voices are taking over the internet

Episode Date: September 11, 2023

In part one of The Vergecast's AI mini series, David Pierce dives into the boom of voice synthesis and artificially generated speech. The process is a lot more accessible for everyone, but how realist...ic can it sound? Further reading: AI voices are taking over the internet Everyone will be able to clone their voice in the future Email us at vergecast@theverge.com or call us at 866-VERGE11, we'd love to hear from you. Learn more about your ad choices. Visit podcastchoices.com/adchoices

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Vergecast, the flagship podcast of Neurological Engines for Overdubbing. I'm David Pierce, and hang on one second. Actually, I just got to finish something. Once upon a time, a white snake and a green snake, living in a remote mountain, became immortal and obtained superpowers after centuries of practice. Okay, sorry, I'm back. And before you worry that I just, like, had a stroke or something live on the microphone, I should tell you that I'm actually in the middle of training an AI to mimic.
Starting point is 00:00:30 make the sound of my voice. This one is on the iPhone. It's called personal voice. And products like this are what we're going to talk about today. The idea of being able to create your own AI voice clone has been around for a while. We actually talked a bunch about it on this show in 2021. I'll link to it in the show notes. A lot of it holds up really well. But over the last couple of years, it's gotten both drastically easier to make a vocal AI and the results have gotten drastically better. You can even do it on your phone like I'm doing now. with just a few minutes of really awkward reading of sentences. Here's what mine sounds like.
Starting point is 00:01:06 Hi, I'm David Pearce's AI iPhone Voice. I'm kind of like David, but kind of not. So today on the show, we're going to dive into the boom of voice AI. And then we're going to try to figure out if I can actually make something that sounds like me. This is the Vergecast. We'll be right back. Support for the show comes from Retool. Too many companies run critical operations on duct taped spreadsheets, Slack workflows, and whatever else they could cobble together.
Starting point is 00:01:39 Not because they want to, but because building internal tools means weeks of waiting on someone else's backlog. That's where Retool comes in. Build custom internal tools just by describing what you need. Proms something like, build me a revenue dashboard on our Salesforce data. And Retool actually builds it on your company's data and your cloud with enterprise security built in. Go to Retool.com slash Verchcast. We all need to retool how we build software. What's up, y'all. I'm Skylar Diggins, seven-time WMBA All-Star, Olympic gold medalist, and mom.
Starting point is 00:02:15 And I'm Cassidy Hubbard, host and reporter for nearly 20 years covering the biggest names and stories in sports. And mom. And this is Am Mom, a community for athletes, game changers, and moms of all kinds. Dropping May 14th. Tap in with us. All right, we're back. Before we get too deep into the world of AI voices, let's try and quickly understand why this is such a big thing right now.
Starting point is 00:02:42 And as far as I can tell, there are basically three reasons this kind of tech is booming. The first is because audio in general is booming, with podcasts and voice messages and those generated spoken captions you hear on TikTok and all sorts of other things. If you think about it, you probably hear the Internet a lot more than you used to. And like any other creative feature on the Internet, a lot of tools exist to help you make it. One app we use is called Descript.
Starting point is 00:03:10 It's an app that a lot of people use for editing audio and video. It has this feature called Overdub Voices. And one of Descript's coolest features in general is that you can edit audio and video basically by editing text. You import a file, it gives you a transcript, and if you delete the word um in the transcript, it'll also try and seamlessly delete the um from the actual audio file. It's not perfect, but it works pretty well, and it kind of feels like magic to use it. With overdub, Descript can go even further.
Starting point is 00:03:39 Let's say you forgot something or you stumbled on a word or transition. You can now make an AI copy of your own voice and insert new audio just by typing the text you want to appear. So let's say I say the sentence, the iPhone came out in 2007 when Steve Jobs announced it as three things, a widescreen iPod, a revolutionary mobile phone, and a breakthrough internet communicator. All right, let's just hear that sentence back. The iPhone came out in 2007 when Steve Jobs announced it as three things, a widescreen iPod, a revolutionary mobile phone, and a breakthrough internet communicator.
Starting point is 00:04:14 Wait, sorry, I got that slightly wrong. He called it a breakthrough internet communications device. I could re-record that whole thing, or I could just go into Descript, re-type the transcript, and here's what I get. The iPhone came out in 2007 when Steve Jobs announced it as three things, a widescreen iPod, a revolutionary mobile phone, and a breakthrough internet. communications device. It's not bad. I wouldn't want to listen to a whole hour of that voice, I don't think, but in small bits and especially in the context of something larger, I'm not even sure you'd always notice it there. There are other apps out there like Podcastle doing the same kind of thing, and I suspect you're going to see tools like this show up anywhere that people make audio. Okay, so that's the first use case. The second is kind of the flip side.
Starting point is 00:05:01 There are also a bunch of tools out there using AI voices to read written stories out loud. The Atlantic, for instance, is working with a company called 11 Labs to have an AI narrator read some of the stories on the website. For years, the American approach to protein has been a never-ending quest for more. On average, each person in the United States puts away roughly 300 pounds of meat a year. Again, it's not perfect, and I don't know that it always sounds like a person, but I kind of can't believe how good. good it is. It wasn't that long ago, by the way, that these generated voices sounded like flat, toneless robots. Like, here's Sophia, the robot from 2016 that was considered to be one of the most advanced robots ever created on The Tonight Show. I traveled to over 25 countries,
Starting point is 00:05:48 appeared on the cover of Cosmopolitan magazine, met German Chancellor Anger Lamerkel, and the actor Will Smith, and became Twitter friends with Chrissy Teigen. And here is that same thing Sophia said, which I just typed into the generator on the 11 Labs website, I picked a voice name Grace, put this in there, click generate, and after about 10 seconds, this is what came out. I traveled to over 25 countries, appeared on the cover of Cosmopolitan magazine, met German Chancellor Angela Merkel and the actor Will Smith, and became Twitter friends with Chrissy Teigen. I mean, it's night and day, right?
Starting point is 00:06:24 I think you're going to start to see this everywhere. Articles, whole websites, entire books, all read all. allowed all using generated AI. And the product itself is actually starting to be pretty good. It is also, of course, a huge ethical and legal disaster. All the way back in 2019, again, before the tech was nearly as good as it is now, a bunch of publishers sued Audible over a feature called Audible Captions, which would read a book aloud to you as you looked at the page.
Starting point is 00:06:54 Seems like a normal, useful feature, right? Also seems like an existential threat to the entire idea and industry of audiobooks. Audible and the publishers settled in 2020, but that was only the beginning of the bigger questions here. Some audiobook narrators have worried that their voices are being used to train algorithms that might someday replace them, and they're not really wrong. All this is not at all theoretical. If you go into the Apple Books app and search for AI narration, you'll find a bunch of audiobooks that say they are narrated by Apple Books. Apple says that that means that they are, quote, narrated by a digital voice based on a human narrator. Here's just a sample from a book called Language of Love by Kristen Etheridge.
Starting point is 00:07:34 A lot of these AI-narrated books are romance novels, by the way, for whatever reason. And this one sounds to my ears, shockingly, like a human-read audiobook. He raised his fist and rapped on the solid wood. After about 30 seconds of silence, the distinct sound of the lock turning broke through. A woman of average height stepped into the sliver of an opening. The other version of this that you might have heard about, or even encountered without knowing, is celebrity AI voices. Like, there was a pretty big backlash a couple of years ago
Starting point is 00:08:04 when a documentary about Anthony Bourdain called Roadrunner, which came out after Bordane's death, trained an AI model on his voice and then used it to generate narration for the film. The director, Morgan Neville, said that he only used the AI to say words that Bordane himself had written, which was an ethical choice for him, and I guess I can see where he's coming from.
Starting point is 00:08:24 I still don't know whether any of that feels okay to me or awful. It's all really complicated. And examples like this exist everywhere. AI helped Val Kilmer speak after he lost his voice due to throat cancer. Lots of celebrities trained AI to do things like give you ways directions. All this, too, is pretty controversial. One of the things Hollywood is on strike about right now is AI's potential to scan their likeness so that they never need to be actually used in films again.
Starting point is 00:08:50 Imagine an AI trained on Morgan Freeman's voice that could narrate every documentary ever without paying Freeman a dime. This stuff all gets really messy. really fast. Okay, and then we have the third and probably most newly mainstream use case here, accessibility. Apple launched a new feature this year in iOS 17 called Live Speech, which you can use to type something and have it said out loud in phone calls or even for in-person conversation. And when you pair it with personal voice, another new feature this year, the one I was testing up at the beginning of the show, you can create an AI version of your own voice
Starting point is 00:09:25 just by recording yourself talking into your phone and then use that to generate. your live speech. It's all a little like the incredibly powerful system that the late Stephen Hawking had, which let him speak through a computer. It can offer a list of predictions based on an analysis of the English language in my previous usage. Okay, again, not to keep belaboring this point, but that video is from eight years ago. Think about how much better a system like Hawking would sound today. Although, I have to say, I do love how much Hawking embraced that robotic sound and made it his own.
Starting point is 00:09:58 It has become my trademark, and I wouldn't change it for a more natural voice with a British accent. Samsung is building a similar feature with Bixby, so that you can now speak with your own voice through your Galaxy phone. Works kind of the same way. And on a similar line, lots of people who've used screen readers for years, which are able to speak aloud whatever's on a screen. Those are also getting vastly better, both because the voices are improving and because AI systems are getting much better at actually understanding the contents of webpig. and apps and anything else you're looking at. All of that is super exciting. And I'm also really into the idea of being able to use machine translation in these voices to be able to speak simultaneously in lots of languages. Someday, not that far from now, this podcast with my voice
Starting point is 00:10:44 could be available in basically any language on earth. That's really cool. It's also a really hard problem and we're definitely not there yet. But in general, being able to speak with your own voice even when you can't do that is a big deal. It's complicated. It's complicated. and morally and legally and in so many other ways. But it's a big deal nonetheless. All right, we need to take a quick break. And then when we come back, we're going to investigate what it takes to actually make an AI voice
Starting point is 00:11:11 and see if it's really possible to do it well. We'll be right back. Support for this show comes from Shopify. Every thriving, successful business has to start somewhere. A good place to start is a relatively simple question. What if, given the right tools, I really put my all into this. One tool that can help grow your sprouting business to new heights is Shopify.
Starting point is 00:11:48 Millions of businesses around the world rely on Shopify for e-commerce. They offer a host of helpful tools you can take advantage of, from payment processing to analytics to website design. Their design studio includes hundreds of templates to help you create the exact website you've been envisioning for your business. If you're wondering, what if I need help? Then no worries, because you're never left to fend for yourself. Shopify's award-winning customer support is available 24-7.
Starting point is 00:12:16 It's time to turn those what-ifs into a thriving business with Shopify today. Sign up for your $1 per month trial today at Shopify.com slash vergecast. Go to Shopify.com slash vergecast. That's Shopify.com slash vergecast. Support for the show comes from Upwork. The days of doing it all, all by yourself, are over. There's no romance in burning out while you're trying to scale. Instead, you can check out Upwork. Upwork helps grow your business by giving you fast access to specialize talent across
Starting point is 00:12:56 more than 125 categories so you can fill skill gaps, launch projects faster, and scale without committing to full-time headcount. And finding the right talent is easy. You can browse profiles, review past work, and get help scoping the role so you can get started quickly. Seriously, you could connect with the right freelancer in just a few hours, especially when you sign up with Business Plus. Their AI-powered shortlisting pairs you with the top 1% of talent in under six hours. No endless search are required. You can visit upwork.com right now to post your job for free.
Starting point is 00:13:33 That's upwork.com to connect with top talent ready to help your business grow. That's upw-w-rk.com. Upwork.com. Welcome back. Let's make some AI voices, shall we? The idea with most of these systems is basically the same, because the way you train an AI model in general is just to give it lots and lots and lots of data and just kind of watch it churn through and see what it learns. But in the systems, I've been trying, there is one important distinction.
Starting point is 00:14:07 Some tools, like Descripts, just ask for a huge batch of audio. They'll give you a script if you want it, but really the goal is just to upload hours and hours and hours of the sound of your voice and see what happens. Others go one step further and will ask you to record yourself saying a series of specific and often weird and often thoroughly random things. So like when I open up Podcastle to create my digital AI voice, it had a lot of really specific instructions. Okay, now, time to do 70 sentences. Here we go. Everything seems better in summer. I asked my dad if he could help me.
Starting point is 00:14:44 Look at that lovely cat. That you have it. 70 sentences later, I waited a while, and the next day I got an email saying my digital voice was ready. You want to hear it? Hi, I'm David Pierce. Except not really. I'm an AI bot, but I've been trained to sound like David Pierce. Is this convincing?
Starting point is 00:15:05 That one's not great, not super impressed. But let's give it a little more to work with and see how we do. While we're on the subject of ethically dubious things, I'm going to grab the text of one of my favorite TV moments ever. It's a Dwight Shrewd speech from the office. What is my perfect crime? I break into Tiffany's in midnight. Do I go for the vault? No, I go for the chandelier.
Starting point is 00:15:27 It's priceless. As I'm taking it down, a woman catches me. She tells me to stop. It's her father's business. She's Tiffany. I say no. We make love all night. In the morning, the cops come and I escape in one of their uniforms.
Starting point is 00:15:39 I tell her to meet me in Mexico, but I go to Canada. I don't trust her. Besides, I like the cold. 30 years later, I get a postcard. I have a son, and he's the chief of police. This is where the story gets interesting. I tell Tiffany to meet me in Paris by the Trocadero. She's been waiting for me all these years.
Starting point is 00:15:56 She's never taken another lover. I don't care. I don't show up. I go to Berlin. That's where I stashed the chandelier. I mean, it's just like 60 perfect seconds. I love it so much. Let's have AI David take a run at that speech.
Starting point is 00:16:08 Here goes. What is my perfect crime? I break into Tiffany's at midnight. Do I go for the fall? No, I go for the chandelier. It's priceless. As I'm taking it down, a woman catches me. She tells me to stop.
Starting point is 00:16:22 It's her father's business. She's Tiffany. I say no. We make love all night. Okay, so I hear that. And it's like, yeah, that sounds like me. But it also doesn't sound like a human, if that makes sense. In general, Podcastle was really easy to use, but I'm not terribly impressed with the outcome.
Starting point is 00:16:40 So now let's try Descript, which is, I think, in general, a significantly more sophisticated piece of audio software. It too is a process. So I go to voices. Yeah, we're creating a new voice. All right, we'll do a few recent Friday Vergecasts. How about that? Preparing and uploading. Okay, it finally uploaded all my stuff. Let's go. We click submit training data. And now it says, ready to create your overdub voice, record your voice ID, press record, and read the statement below. All right, let's do it. I hit stop. We submit it. We are uploading. It says putting the finishing touches on your training project. Your voice is now training. We'll email you when it's done. Here we go. I ended up submitting about four hours of my own voice to make this happen because luckily
Starting point is 00:17:24 I already have hours of my voice recorded from just being on the verge cast. And like with Podcastle, it took a while to process everything and then I got an email that my voice was ready, which is a very funny email to receive. Here's what it sounded like. Hi, I'm David Pierce, the AI David Pierce, the Descript version of AI David Pierce. How do I sound? All I hear in that is that I feel like that's what I might sound like if I'd gone to like a really fancy New England boarding school and also had a really, really nasty head cold. But I don't think in general that one sounds like me at all, really. But let's try it again with our Dwight Fruit speech. What is my perfect crime?
Starting point is 00:18:02 I break into Tiffany's at midnight. Do I go for the vault? No, I go for the chandelier. It's priceless. I have a son and he's the chief of police. This is where the story gets interesting. I tell Tiffany to meet me in Paris by the Trocadero. She's been waiting for me all these years.
Starting point is 00:18:16 She's never taken another lover. I don't show up. I go to Berlin. That's where I stashed the chambolier. The strange thing about this one is the intonation. The kind of ebb and flow of the sentences here. It's really not bad. It's a little stilted, but it does move more or less like you would expect a human to talk.
Starting point is 00:18:34 It just doesn't sound right. And it seems to skip a bunch of words and sentences when it doesn't quite know what to do. My takeaway is basically, Descript is fine for those little filler words like we were doing earlier. earlier, but that's about it. I would say in general so far, my takeaway is that these things aren't amazing, but they're decent, and honestly, it's really easy to make them, like, much easier than I expected. So let's keep going.
Starting point is 00:18:58 Let's do a couple more. 11 Labs, the company we've talked about a bunch so far, has the simplest process of any that I've seen. You just sign up, upload a few minutes of audio. It actually explicitly says you only need about five minutes and that anything more is just overkill, and you're off and running. So I added some Virgcast stuff about. 15 minutes and all because, you know, I'm an overachiever, and then just waited a while.
Starting point is 00:19:21 This one only took a couple of minutes, and we were up and running. Hi, it's AI David Pierce again. This time I'm made by 11 labs, but I'm still me. Sort of, I think, you know what I mean. I'm not going to lie it. That one kind of gave me goosebumps. It goes a little fast, like I don't think that's how you'd say that sentence, but this is way better than anything else I've tried or even heard. And it took a grand total of about 90 seconds to put together. What's weird, though, is that it's not always this good. I clicked generate again with the same text, and it spit back something subtly different and I think slightly worse. Hi, it's AI David Pierce again. This time I'm made by 11 labs, but I'm still me,
Starting point is 00:20:01 sort of, I think. You know what I mean? Again, really good, better than anything else we've tried, but not quite as good as that first one, which is odd. It's just that pause in the first one right before the word sort of, is like exactly how I would have said that in real life. I still kind of can't get over it. It freaked me out. Anyway, let's try this model with our Dwight Shrewd speech. What is my perfect crime?
Starting point is 00:20:23 I break into Tiffany's at midnight. Do I go for the vault? No, I go for the chandelier. It's priceless. As I'm taking it down, a woman catches me. She tells me to stop. It's her father's business. She's Tiffany.
Starting point is 00:20:36 I say no. We make love all night. In the morning, the cops come, and I escape in one of their uniforms. I tell her to meet me in Mexico, but I go to Canada. I don't trust her. Besides, I like the cold. 30 years later, I get a postcard.
Starting point is 00:20:51 I have a son, and he's the chief of police. This is where the story gets interesting. I tell Tiffany to meet me in Paris by the Trocadero. She's been waiting for me all these years. She's never taken another lover. I don't care. I don't show up. I go to Berlin.
Starting point is 00:21:06 That's where I stash the chandelier. That one's not perfect, and it seems to me that it got a little worse as it went on. The cadence got a little less human and a little more just kind of robot, monotone, everything takes the same time to say. You know what I mean? But I bet I could use that voice on almost anyone for a minute or two and get away with it. What is my perfect crime? Okay, let's try one more.
Starting point is 00:21:28 This is the Apple personal voice feature, the one that's going to come to lots of people's iPhones. I suspect a lot of people are going to set this up pretty soon. This one took, by the way, the longest by five. of any to set up. So the first thing I have to do is decide, do I want to share across devices? I'm sure.
Starting point is 00:21:44 Do I want to allow apps to request to use? Why would I want that? Creating my personal voice. I'm going to read 150 phrases aloud, which may take about 15 minutes. Then it's going to generate it, and then we'll go from there. All right, let's see.
Starting point is 00:21:56 That was the best movie I've ever seen. Are you still hungry? It's a beautiful day today. It's an extension of the nearby sea. Her style of painting shows the influence of the French artists. Then, once it was set up, it also took the longest to finish. Your phone has to be charging and not in use because all the training happens on your device
Starting point is 00:22:17 and takes up a huge amount of energy. I love that it happens on device. That's good for privacy reasons. It's good for lots of reasons, but it does take a while. So it took a couple of days, but I eventually had my voice ready to go. Hi, I'm David Pierce's AI iPhone voice. I'm kind of like David, but kind of not. Are we our phones?
Starting point is 00:22:34 Are our phones us? ordinarily, I think I would have been impressed by that, but after hearing that 11 Labs one, I'm kind of meh on this one. Let's try the Dwight True Test. 30 years later, I get a postcard. I have a son, and he's the chief of police. This is where the story gets interesting.
Starting point is 00:22:52 I tell Tiffany to meet me in Paris by the Trocadero. She's been waiting for me all these years. She's never taken another lover. I don't care. I don't show up. I go to Berlin. That's where I stashed the chandelier. Still pretty okay, right? It kind of works, but nobody's going to confuse this with human, David.
Starting point is 00:23:10 And in general, I actually think that might be okay. AI voices are one of those things where the better they get, the stranger they get. Seriously, that feeling I got listening to the first time 11 labs spit out that thing saying I'm David Pierce was genuinely kind of disconcerting. It raises all these big questions that, like with so many things about AI, we've really only begun dealing with. What does it mean that I can create a replica this good and that they're only going to get better and easier over time?
Starting point is 00:23:42 What responsibilities do I have as the person who made it and is using it, even though it's my voice? What responsibilities do other people have? What responsibilities do the services who make these voices for me have now that they have this incredibly personal thing of mine on their servers? We're having a lot of debates over AI music right now, obviously, as artists' voices are being used to train models that can make pretty convincing songs in just about anyone's voice. You go on YouTube and you can hear AI Taylor Swift sing almost anything.
Starting point is 00:24:12 You can hear AI Patrick from SpongeBob sing almost anything. All of that is going to spawn like a decade of interesting court cases and ethical debates, but those same issues are coming just for you and me in our everyday lives. How do we use these tools? How do we talk about the fact that they exist and how we're using them? Is it even possible to get the good, helpful, democratizing things from them without all the deepfakes and downsides? I don't know, but I do know it's well past time we started talking about it, because the tech is really good right now, and it's getting better really fast. That is enough AI talk for one day.
Starting point is 00:24:52 We're going to be back next week to talk even more about AI music, because I think that is one of the most interesting things in this space right now, not just because of the big heady debates about it all, but because I think they're really interesting. ways that AI can both help people make music and totally change all of our ideas about what music even is. It's going to be fascinating. We will also be back in this feed on Wednesday, too, with a big episode all about this week's Apple event. But until then, AI, David, you want to do the credits? This show is produced by Andrew Marino and Liam James. Brooke Minters is our editorial director of audio. The Vergecast is a Verge production and part of the Vox Media Podcast Network. If you have thoughts, questions, ideas, or anything else, you can email Vergecasttheverge.com or call the Verge Hotline at 866 Verge 11. We'll be back on Wednesday with a special show all about the Apple event
Starting point is 00:25:45 and with all the rest of this week's news on Friday. We'll see you then. Rock and roll. Okay, not bad. Pretty good AI David for your first try, but it's Vergecast at theverge.com for the emails and we really got to work on the sign-offs. Say it with me. Rock and roll.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.