TED Talks Daily - How AI models steal creative work — and what to do about it | Ed Newton-Rex

Starting point is 00:00:00 What if everything you thought you knew about your own child was a lie? Ellen Pompeo and Mark DuPlace star in Good American Family. In this all-new limited series, a couple adopt what they believe is an 8-year-old girl, but when concerns soon arise, it forces them to question her actual identity. Told from multiple perspectives, Good American Family is inspired by the true events behind the disturbing story of Natalia Grace, and unpacks the case that spiraled from private suspicion to public spectacle. Streaming March 19th, only on Disney+.

Starting point is 00:00:34 Support for this show comes from Airbnb. Last summer my family and I had an amazing Airbnb stay while adventuring in Playa del Carmen. It was so much fun to bounce around in ATVs, explore cool caves, and snorkel in subterranean rivers. Vacations like these are never long enough, but perhaps I could take advantage of my empty home by hosting it on Airbnb while I'm away. And then I could use the extra income to stay a few more days on my next Mexico trip. It seems like a smart thing to do

Starting point is 00:01:04 since my house sits empty while I'm away. We could zipline into even more cenotes on our next visit to Mexico. Your home might be worth more than you think. Find out how much at airbnb.ca slash host. This episode is sponsored by Cozy. You know how daunting it can be to transform your living space. Well there's this Canadian furniture company called Cozy that's aiming to make that process a whole lot easier. Cozy is all about blending style with practicality. Their furniture is customizable, so people can start small and add pieces as they go.

Starting point is 00:01:40 And get this, they've got this AR feature that lets you see how the furniture looks in your space before you buy. Pretty cool, right? They've also launched the new Mistral Outdoor Dining Collection. It's designed for creating the ultimate patio setup with powder-coated aluminum furniture that's both durable and easy to store. Cozy offers free swatches and quick two to five day shipping. Seems like they're really trying to simplify the whole furniture buying process.

Starting point is 00:02:04 So if you are thinking about giving your space a makeover, you might want to check day shipping. Seems like they're really trying to simplify the whole furniture buying process. So if you are thinking about giving your space a makeover, you might wanna check it out. Transform your living space today with Cozy. Visit Cozy.ca, spelled C-O-Z-E-Y, to start customizing your furniture. You're listening to Ted Talks Daily, where we bring you new ideas to spark your curiosity every day. I'm your host, Elise Hugh.

Starting point is 00:02:31 As a writer, I spend countless hours on the work I put out into the world, and of course, I want to be credited for it. Bylines matter. That's why this talk by Ed Newton Rex hits close to home. He's a generative AI expert, and he shares what he considers to be the dark side of AI today. AI models training on, or as he puts it, stealing unlicensed work by millions of creators.

Starting point is 00:02:56 He lays out his vision for a world where AI and creative industries can exist symbiotically. Coming up. can exist symbiotically. Coming up. The technology and vision behind generative AI is amazing. But stealing the work of the world's creators to build it is not. There are three key things that AI companies need to build their models. Three key resources. People, compute, and data.

Starting point is 00:03:26 That is, engineers to build the models, GPUs to run the training process, and data to train the models on. AI companies spend vast sums on the first two, sometimes a million dollars per engineer and up to a billion dollars per model. But they expect to take the third resource, training data, for free. Right now, many AI companies train on creative work they haven't paid for or even asked permission to use. This is unfair and unsustainable.

Starting point is 00:04:02 But if we reset and license our training data, we can build a better generative AI ecosystem that works for everyone, both the AI companies themselves and the creators, without whose work these models would not exist. Most AI companies today do not license the majority of their training data. They use web scrapers to find, download, and train on as much content as they can gather. They're often pretty secretive about what they do train on, but what's clear is that training on copyrighted work

Starting point is 00:04:33 without a license is rife. For instance, when the Mozilla Foundation looked at 47 large language models released between 2019 and 2023, they found that 64% of them were trained in part on Common Crawl, a data set that includes copyrighted works such as newspaper articles from major publications, and a further 21% didn't reveal enough information to know either way. Training on copyrighted work without a license has rapidly become standard across much of the generative AI industry.

Starting point is 00:05:07 But this training, this unlicensed training on creative work, has serious negative consequences for the people behind that work. And this is for the simple reason that generative AI competes with its training data. This is not the narrative that AI companies like to portray. We like to talk about democratisation, about letting more people be creative, but the fact that AI competes with its training data is inescapable. A large language model trained on short stories can create competing short stories. An AI image model trained on stock images can create competing stock images. An AI music model trained on music that's licensed to TV shows can create competing music to license to TV shows.

Starting point is 00:05:47 These models, however imperfect, are so quick and easy to use that this competition is inevitable. And this isn't just theoretical. Generative AI is still pretty new, but we're already seeing exactly the sort of effects you'd expect in a world in which generative AI competes with its training data. For instance, the well-known filmmaker Ram Gopal Varma recently said that he'll use AI music in all his projects going forward. Indeed, there are multiple reports of people starting to listen to AI music

Starting point is 00:06:20 in place of human-produced music, and recently an AI song hit number 48 in the German charts. In all these cases, AI music is competing with the songs it was trained on. Or take Kelly McKernan. Kelly is an artist from Nashville. For 10 years, they made enough money selling their work. The art was their full-time income. But in 2022, a data set that included their works was used to train a popular AI image model. Their name was one of many, used by huge numbers of people to create art

Starting point is 00:07:00 in the style of specific human artists. Kelly's income fell by 33% almost overnight. Illustrators around the world report similar stories being out-competed by AI models they have reason to believe were trained on their work. The freelance platform Upwork wrote a white paper in which they looked at the effects that they've seen on the job market of Generative AI. They looked at how job postings on their platform have changed

Starting point is 00:07:29 since the introduction of ChatGPT. And sure enough, they found exactly what you'd expect, that Generative AI has reduced the demand for freelance writing tasks by 8%, which increases to 18% if you look at only what they term lower value tasks. So, the initial data we have have plus the individual stories we hear all align with the logical assumption. Generative AI competes with the work it's trained on.

Starting point is 00:07:56 It's so quick and easy to use, it's inevitable, and it competes with the people behind that work. Now, creators argue this training is illegal. The legal framework of copyright affords creators the exclusive right to authorize copies of their work, and AI training involves copying. Here in the US, many AI companies argue that training AI falls under the fair use copyright exception, which allows unlicensed copying in a limited set of circumstances such as creating parodies of a work. Creators and rights holders strongly disagree saying there's no way this narrow exception can be used to legitimize the mass exploitation of creative work to create automated

Starting point is 00:08:41 competitors to that work and for the the record, I entirely agree. Of course, this question is previously untested in the courts, and there are currently around 30 ongoing lawsuits brought by rights holders against AI companies, which will help to address this question. But this will take time, and creators are suffering from what they see as unjust competition right now. So they propose a solution that has been used and worked before. Licensing.

Starting point is 00:09:12 If a commercial entity wants to use copyrighted work, be it for merchandise manufacturing or building a streaming service, they license that work. Now, AI companies have a bunch of reasons why this shouldn't apply to them. There's the fair use legal exception that I've already mentioned. There's also the argument that since humans can train on copyrighted work without a license, AI should be allowed to too. But this is a very hard claim to justify.

Starting point is 00:09:42 Artists have been learning from each other for centuries. When you create, you expect other people to learn from you. You learn from a range of sources, from other art to textbooks to taking lessons, much of this you or someone else paid for, supporting the entire ecosystem. In generative AI, commercial entities valued at millions or billions of dollars scrape as much content as they can, often against creators' will, without payment, making multiple copies along the way,

Starting point is 00:10:11 which are subject to copyright law, to create a highly scalable competitor to what they're copying. So scalable, in fact, that there are AI image generators, estimated to be making 2.5 million images a day, and AI song generators, outputting 10 songs a second. To argue that human learning and AI training are the same and should be treated the same is preposterous. AI companies also argue that licensing their training data

Starting point is 00:10:37 would be impractical. They use so much training data, they say, that individual payments to each creator behind the data would be small. But this is true of many content licensing markets. Creators still want to get paid, even if the payments are small. AI companies also argue that they simply use too much data for licensing to even be feasible. But this is harder and harder to believe in a world in which there is such a range of data sets that you can access with permission. You can license data from media companies.

Starting point is 00:11:09 There have been 27 major deals between AI companies and rights holders in the last year alone and that's to say nothing of the smaller ones that don't get reported. There are marketplaces of training data where you can get more data. You can expand this with data that's in the public domain, that is, in which no copyright exists, like the 500 billion word data set Common Corpus. You can expand this further with synthetic data, that is, data that's created itself by an AI model in which usually

Starting point is 00:11:38 no copyright exists. So there are multiple options available to you if you want to build your model without infringing copyright. But the strongest evidence that it's possible to license all your data is that there are multiple companies doing it already. I know because I've done it myself. I've worked in what we now call Generative AI for over a decade, and last September my team at Stability AI released an AI music model that trained on licensed music. A number of other companies have done the same thing, and I founded Fairly Trained in order to highlight this fact and these companies. Fairly Trained is a non-profit that certifies generative AI companies that don't train on copyrighted work without a license.

Starting point is 00:12:26 We launched in January of this year and we've already certified 18 companies. Now these companies take a variety of approaches to licensing their training data. We have an AI voice model that's trained on individual voices it's licensed. We have an AI music model that's licensed more than 40 music catalogs. We have a large language model that's trained only on data in the public domain, mostly from government documents and records. We have companies who have paid upfront fees for their data.

Starting point is 00:12:55 We have companies who share their revenue with their data providers. There is no one answer to the exact specifics of how one of these licensing deals has to work. The beauty of licensing is that the two parties can come together and figure out what works for them. And this is happening more and more. Now you will hear that a requirement to license training data somehow stifles innovation.

Starting point is 00:13:17 That it's only the big AI companies that can afford these huge upfront licensing fees. But in reality, it's the smaller startups who are bothering to license all their data, and they're doing so often without hefty upfront licensing fees, but using models such as revenue shares. And there's another major upside to licensing your training data. All of this training on copyrighted work is forcing publishers to shut off access to their content. The Data Provenance Initiative looked at 14,000 websites commonly used in AI training sets.

Starting point is 00:13:53 And they found that over the course of a single year, looking at only the domains of the highest value for AI training, the number that was restricted via opt-outs or terms of service increased from 3% to between 20 and 33%. The web is being gradually closed due to unlicensed training. Now, this is bad for new AI models, for new entrants to the market, but also for everyone, researchers, consumers, and more who benefit from an open internet.

Starting point is 00:14:21 It should come as no surprise that the general public do not agree with AI companies about what they can train their models on. One poll from the AI Policy Institute in April asked people about the common policy among AI companies of training on publicly available data. This is data that is openly available online, which of course includes a lot of copyrighted work like news articles and often pirated media. which is data that is openly available online, which of course includes a lot of copyrighted work, like news articles and often pirated media. Sixty percent of people said this should not be allowed,

Starting point is 00:14:52 versus only 19 percent who said it should. The same poll went on to ask whether AI companies should compensate data providers. Seventy-four percent said yes, and only 9% said no. Time and time again, when we ask the public these questions, they show support for requirements around permission and payment and a rejection of the notion that something being publicly available somehow makes it fair game. And the people who make the art that society consumes feel the same way.

Starting point is 00:15:26 Today we launched a statement on AI training. A short simple open letter which simply reads, the unlicensed use of creative works for training generative AI is a major unjust threat to the livelihoods of the people behind those works and must not be permitted. This has already been signed by 11,000 and counting creators around the world, including Nobel-winning authors, Academy Award-winning actors and Oscar-winning composers. And if you agree with this sentiment, I encourage you to sign it today at aittrainingstatement.org. What this statement and previous ones like it make abundantly clear is that these artists, these creators,

Starting point is 00:16:06 view the unlicensed training on their work by generative AI models as totally unjust and potentially catastrophic to their professions. So if you are an advocate for unlicensed AI training, just remember that the people who wrote the music that you're listening to and the books you're reading probably disagree. So where does this leave us? Well, right now, many of the world's artists, writers, musicians, creators, straight up hate generative AI.

Starting point is 00:16:35 And we know from their own words that one of the reasons for this is that we're training on their work without asking them. But it doesn't have to be this way. The AI industry and the creative industries can be and should be mutually beneficial. But for this mutually beneficial relationship to emerge, we have to start from a position of respect for the value of the works being trained on

Starting point is 00:16:58 and the rights of the people who made them. I'm not arguing that all AI development should be halted. I'm not arguing that AI should not exist. What I'm arguing is that the resources used to build generative AI should be paid for. Licensing is hard work. It will slow you down in the short term, but you'll ultimately reach exactly the same point, models that are just as capable, just as powerful,

Starting point is 00:17:22 and you'll do so without forcing the world's publishers to batten down the hatches and destroy the commons, and without pitting the world's creators against you. So I hope that more AI companies will follow the example set by those we've certified at Fairly Trained and license all their training data. I hope that employees at these companies will demand this of their employers. And I hope that employees at these companies will demand this of their employers.

Starting point is 00:17:46 And I hope that everyone who uses generative AI will ask what their favorite models were trained on. There is a future in which generative AI and human creativity can coexist, not just peacefully, but symbiotically. It's been a rough start, but it's not too late to change course. Thank you. curation, find out more at ted.com slash curation guidelines. And that's it for today's show. TED Talks Daily is part of the TED Audio Collective. This episode was produced and edited by our team,

Starting point is 00:18:31 Martha Estefanos, Oliver Friedman, Brian Green, Lucy Little, Alejandra Salazar, and Tansika Sarmarnivon. It was mixed by Christopher Faisy-Bogan, additional support from Emma Taubner and Daniela Balorizo. I'm Elise Hu, I'll be back tomorrow with a fresh idea for your feed. and Daisy Bogan. Additional support from Emma Taubner and Daniela Balarezzo. I'm Elise Hu. I'll be back tomorrow with a fresh idea for your feed. Thanks for listening.

Starting point is 00:18:49 What if everything you thought you knew about your own child was a lie? Ellen Pompeo and Mark Dupllace star in Good American Family. In this all-new limited series, a couple adopt what they believe is an 8-year-old girl, but when concerns soon arise, it forces them to question her actual identity. Told from multiple perspectives, Good American Family is inspired by the true events behind the disturbing story of Natalia Grace, and unpacks the case that spiraled from private suspicion to public spectacle. Streaming March 19th, only on Disney+.

Your Ad Here

TED Talks Daily - How AI models steal creative work — and what to do about it | Ed Newton-Rex

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.