TED Talks Daily - How AI models steal creative work — and what to do about it | Ed Newton-Rex
Episode Date: March 14, 2025Generative AI is built on three key resources: people, compute and data. While companies invest heavily in the first two, they often use unlicensed creative work as training data without permission or... payment — a practice that pits AI against the very creators it relies on. AI expert Ed Newton-Rex has a solution: licensing. He unpacks the dark side of today's AI models and outlines a plan to ensure that both AI companies and creators can thrive together. Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
What if everything you thought you knew about your own child was a lie?
Ellen Pompeo and Mark DuPlace star in Good American Family.
In this all-new limited series, a couple adopt what they believe is an 8-year-old girl, but
when concerns soon arise, it forces them to question her actual identity.
Told from multiple perspectives, Good American Family is inspired by the true events behind
the disturbing story of Natalia Grace, and unpacks the case
that spiraled from private suspicion to public spectacle.
Streaming March 19th, only on Disney+.
Support for this show comes from Airbnb.
Last summer my family and I had an amazing Airbnb stay while adventuring in Playa del
Carmen.
It was so much fun to bounce around in ATVs,
explore cool caves, and snorkel in subterranean rivers. Vacations like these
are never long enough, but perhaps I could take advantage of my empty home by
hosting it on Airbnb while I'm away. And then I could use the extra income to
stay a few more days on my next Mexico trip. It seems like a smart thing to do
since my house sits empty while I'm away. We could zipline into
even more cenotes on our next visit to Mexico. Your home might be worth more
than you think. Find out how much at airbnb.ca slash host. This episode is
sponsored by Cozy. You know how daunting it can be to transform your living space.
Well there's this Canadian furniture company called Cozy that's aiming to make that process
a whole lot easier.
Cozy is all about blending style with practicality.
Their furniture is customizable, so people can start small and add pieces as they go.
And get this, they've got this AR feature that lets you see how the furniture looks
in your space before you buy.
Pretty cool, right?
They've also launched the new Mistral Outdoor Dining Collection.
It's designed for creating the ultimate patio setup with powder-coated aluminum furniture
that's both durable and easy to store.
Cozy offers free swatches and quick two to five day shipping.
Seems like they're really trying to simplify the whole furniture buying process.
So if you are thinking about giving your space a makeover, you might want to check day shipping. Seems like they're really trying to simplify the whole furniture buying process. So if you are thinking about giving your space a makeover,
you might wanna check it out.
Transform your living space today with Cozy.
Visit Cozy.ca, spelled C-O-Z-E-Y,
to start customizing your furniture.
You're listening to Ted Talks Daily, where we bring you new ideas to spark your curiosity
every day.
I'm your host, Elise Hugh.
As a writer, I spend countless hours on the work I put out into the world, and of course,
I want to be credited for it.
Bylines matter.
That's why this talk by Ed Newton Rex hits close to home.
He's a generative AI expert, and he shares what he considers to be
the dark side of AI today.
AI models training on, or as he puts it,
stealing unlicensed work by millions of creators.
He lays out his vision for a world where AI
and creative industries can exist symbiotically.
Coming up.
can exist symbiotically. Coming up.
The technology and vision behind generative AI is amazing.
But stealing the work of the world's creators to build it is not.
There are three key things that AI companies need to build their models.
Three key resources. People, compute, and data.
That is, engineers to build the models, GPUs to run the
training process, and data to train the models on.
AI companies spend vast sums on the first two, sometimes a
million dollars per engineer and up to a billion dollars per model.
But they expect to take the third resource, training data, for free.
Right now, many AI companies train on creative work
they haven't paid for or even asked permission to use.
This is unfair and unsustainable.
But if we reset and license our training data, we can build a
better generative AI ecosystem that works for everyone, both the AI companies
themselves and the creators, without whose work these models would not exist.
Most AI companies today do not license the majority of their training data.
They use web scrapers to find, download, and train on
as much content as they can gather.
They're often pretty secretive about what they do train on,
but what's clear is that training on copyrighted work
without a license is rife.
For instance, when the Mozilla Foundation looked at 47 large language models
released between 2019 and 2023,
they found that 64% of them were trained in part
on Common Crawl, a data set that includes copyrighted works such as newspaper articles
from major publications, and a further 21% didn't reveal enough information to know either
way.
Training on copyrighted work without a license has rapidly become standard across much of the generative AI industry.
But this training, this unlicensed training on creative work, has serious negative consequences for the people behind that work.
And this is for the simple reason that generative AI competes with its training data.
This is not the narrative that AI companies like to portray.
We like to talk
about democratisation, about letting more people be creative, but the fact that AI competes
with its training data is inescapable. A large language model trained on short stories can
create competing short stories. An AI image model trained on stock images can create competing
stock images. An AI music model trained on music that's licensed to TV shows can create competing music to license to TV shows.
These models, however imperfect, are so quick and
easy to use that this competition is inevitable.
And this isn't just theoretical.
Generative AI is still pretty new, but we're already seeing
exactly the sort of effects you'd expect in a world in which generative AI competes with its training data.
For instance, the well-known filmmaker Ram Gopal Varma
recently said that he'll use AI music in all his projects going forward.
Indeed, there are multiple reports of people starting to listen to AI music
in place of human-produced music, and recently an AI song hit number 48 in the German charts.
In all these cases, AI music is competing with the songs it was trained on.
Or take Kelly McKernan. Kelly is an artist from Nashville.
For 10 years, they made enough money selling their work.
The art was their full-time income. But in 2022, a data set that included their works
was used to train a popular AI image model.
Their name was one of many,
used by huge numbers of people to create art
in the style of specific human artists.
Kelly's income fell by 33% almost overnight.
Illustrators around the world report similar stories
being out-competed by AI models they have reason to believe were trained on their work.
The freelance platform Upwork wrote a white paper
in which they looked at the effects that they've seen
on the job market of Generative AI.
They looked at how job postings on their platform have changed
since the introduction of ChatGPT.
And sure enough, they found exactly what you'd expect,
that Generative AI has reduced the demand
for freelance writing tasks by 8%, which increases to 18%
if you look at only what they term lower value tasks.
So, the initial data we have have plus the individual stories we hear
all align with the logical assumption.
Generative AI competes with the work it's trained on.
It's so quick and easy to use, it's inevitable,
and it competes with the people behind that work.
Now, creators argue this training is illegal. The legal framework of
copyright affords creators the exclusive right to authorize copies of their work, and AI
training involves copying. Here in the US, many AI companies argue that training AI falls
under the fair use copyright exception, which allows unlicensed copying in a limited set of circumstances such as creating parodies of a work. Creators and rights holders
strongly disagree saying there's no way this narrow exception can be used to
legitimize the mass exploitation of creative work to create automated
competitors to that work and for the the record, I entirely agree.
Of course, this question is previously untested in the courts, and there are currently around
30 ongoing lawsuits brought by rights holders against AI companies, which will help to address
this question.
But this will take time, and creators are suffering from what they see as unjust competition
right now.
So they propose a solution that has been used and worked before.
Licensing.
If a commercial entity wants to use copyrighted work,
be it for merchandise manufacturing or building a streaming service,
they license that work.
Now, AI companies have a bunch of reasons why this shouldn't apply to them.
There's the fair use legal exception that I've already mentioned.
There's also the argument that since humans can train on copyrighted work without a license,
AI should be allowed to too.
But this is a very hard claim to justify.
Artists have been learning from each other for centuries.
When you create, you expect other people to learn from you.
You learn from a range of sources, from other art to textbooks to taking lessons,
much of this you or someone else paid for, supporting the entire ecosystem.
In generative AI, commercial entities valued at millions or billions of dollars
scrape as much content as they can,
often against creators' will, without payment,
making multiple copies along the way,
which are subject to copyright law,
to create a highly scalable competitor to what they're copying.
So scalable, in fact, that there are AI image generators,
estimated to be making 2.5 million images a day,
and AI song generators, outputting 10 songs a second.
To argue that human learning and AI training are the same
and should be treated the same is preposterous.
AI companies also argue that licensing their training data
would be impractical.
They use so much training data, they say,
that individual payments to each creator behind the data
would be small. But this is true of many content licensing markets. Creators still want to
get paid, even if the payments are small. AI companies also argue that they simply use
too much data for licensing to even be feasible. But this is harder and harder to believe in
a world in which there is such a range of data sets that you can access with permission.
You can license data from media companies.
There have been 27 major deals between AI companies and rights holders in the last year
alone and that's to say nothing of the smaller ones that don't get reported.
There are marketplaces of training data where you can get more data.
You can expand this with data that's in the public domain,
that is, in which no copyright exists,
like the 500 billion word data set Common Corpus.
You can expand this further with synthetic data, that
is, data that's created itself by an AI model in which usually
no copyright exists.
So there are multiple options available to you
if you want to build your model without infringing copyright.
But the strongest evidence that it's possible to license all your data is that there are multiple companies doing it already.
I know because I've done it myself. I've worked in what we now call Generative AI for over a decade,
and last September my team at Stability AI released an AI music model that trained on licensed music.
A number of other companies have done the same thing, and I founded Fairly Trained in order to highlight this fact and these companies.
Fairly Trained is a non-profit that certifies generative AI companies that don't train on copyrighted work without a license.
We launched in January of this year and we've already certified 18 companies.
Now these companies take a variety of approaches to licensing their training data.
We have an AI voice model that's trained on individual voices it's licensed.
We have an AI music model that's licensed more than 40 music catalogs.
We have a large language model that's trained only on data
in the public domain, mostly from government documents
and records.
We have companies who have paid upfront fees for their data.
We have companies who share their revenue with their data
providers.
There is no one answer to the exact specifics
of how one of these licensing deals has to work.
The beauty of licensing is that the two parties can come together and figure out what works
for them.
And this is happening more and more.
Now you will hear that a requirement to license training data somehow stifles innovation.
That it's only the big AI companies that can afford these huge upfront licensing fees.
But in reality, it's the smaller startups who are
bothering to license all their data, and they're doing so often without hefty
upfront licensing fees, but using models such as revenue shares.
And there's another major upside to licensing your training data.
All of this training on copyrighted work is forcing publishers to shut off access to their content.
The Data Provenance Initiative looked
at 14,000 websites commonly used in AI training sets.
And they found that over the course of a single year,
looking at only the domains of the highest value
for AI training, the number that was restricted via opt-outs
or terms of service increased from 3%
to between 20 and 33%.
The web is being gradually closed due to unlicensed training.
Now, this is bad for new AI models, for new entrants to the market, but also for everyone,
researchers, consumers, and more who benefit from an open internet.
It should come as no surprise that the general public do not agree with AI companies about
what they can train their models on.
One poll from the AI Policy Institute in April asked people about the common policy among
AI companies of training on publicly available data.
This is data that is openly available online, which of course includes a lot of copyrighted
work like news articles and often pirated media. which is data that is openly available online, which of course includes a lot of copyrighted work,
like news articles and often pirated media.
Sixty percent of people said this should not be allowed,
versus only 19 percent who said it should.
The same poll went on to ask
whether AI companies should compensate data providers.
Seventy-four percent said yes, and only 9% said no.
Time and time again, when we ask the public these questions, they show support
for requirements around permission and payment and a rejection of the notion
that something being publicly available somehow makes it fair game.
And the people who make the art that society consumes feel the same way.
Today we launched a statement on AI training. A short simple open letter
which simply reads, the unlicensed use of creative works for training generative AI
is a major unjust threat to the livelihoods of the people behind those
works and must not be permitted.
This has already been signed by 11,000 and counting creators around the world, including
Nobel-winning authors, Academy Award-winning actors and Oscar-winning composers.
And if you agree with this sentiment, I encourage you to sign it today at aittrainingstatement.org.
What this statement and previous ones like it make abundantly clear is that these artists, these creators,
view the unlicensed training on their work by generative AI models
as totally unjust and potentially catastrophic to their professions.
So if you are an advocate for unlicensed AI training,
just remember that the people who wrote the music that you're listening to
and the books you're reading probably disagree.
So where does this leave us?
Well, right now, many of the world's artists, writers,
musicians, creators, straight up hate generative AI.
And we know from their own words that one of the reasons for this
is that we're training on their work without asking them.
But it doesn't have to be this way.
The AI industry and the creative industries can be
and should be mutually beneficial.
But for this mutually beneficial relationship to emerge,
we have to start from a position of respect
for the value of the works being trained on
and the rights of the people who made them.
I'm not arguing that all AI development should be halted. I'm not arguing that AI should not exist.
What I'm arguing is that the resources used to build generative AI
should be paid for.
Licensing is hard work.
It will slow you down in the short term,
but you'll ultimately reach exactly the same point,
models that are just as capable, just as powerful,
and you'll do so without forcing the world's publishers to batten down
the hatches and destroy the commons,
and without pitting the world's creators against you.
So I hope that more AI companies will follow the example set
by those we've certified at Fairly Trained
and license all their training data.
I hope that employees at these companies
will demand this of their employers. And I hope that employees at these companies will demand this of their employers.
And I hope that everyone who uses generative AI
will ask what their favorite models were trained on.
There is a future in which generative AI and human creativity
can coexist, not just peacefully, but symbiotically.
It's been a rough start, but it's not too late to change course.
Thank you. curation, find out more at ted.com slash curation guidelines. And that's it for today's show.
TED Talks Daily is part of the TED Audio Collective.
This episode was produced and edited by our team,
Martha Estefanos, Oliver Friedman, Brian Green,
Lucy Little, Alejandra Salazar, and Tansika Sarmarnivon.
It was mixed by Christopher Faisy-Bogan,
additional support from Emma Taubner and Daniela Balorizo.
I'm Elise Hu, I'll be back tomorrow with a fresh idea for your feed. and Daisy Bogan. Additional support from Emma Taubner and Daniela Balarezzo.
I'm Elise Hu.
I'll be back tomorrow with a fresh idea for your feed.
Thanks for listening.
What if everything you thought you knew about your own child was a lie?
Ellen Pompeo and Mark Dupllace star in Good American Family.
In this all-new limited series, a couple adopt what they believe is an 8-year-old girl, but
when concerns soon arise, it forces them to question her actual identity.
Told from multiple perspectives, Good American Family is inspired by the true events behind
the disturbing story of Natalia Grace, and unpacks the case
that spiraled from private suspicion to public spectacle.
Streaming March 19th, only on Disney+.