Planet Money - The alleged theft at the heart of ChatGPT
Episode Date: November 10, 2023When best-selling thriller writer Douglas Preston began playing around with OpenAI's new chatbot, ChatGPT, he was, at first, impressed. But then he realized how much in-depth knowledge GPT had of the ...books he had written. When prompted, it supplied detailed plot summaries and descriptions of even minor characters. He was convinced it could only pull that off if it had read his books.Large language models, the kind of artificial intelligence underlying programs like ChatGPT, do not come into the world fully formed. They first have to be trained on incredibly large amounts of text. Douglas Preston, and 16 other authors, including George R.R. Martin, Jodi Piccoult, and Jonathan Franzen, were convinced that their novels had been used to train GPT without their permission. So, in September, they sued OpenAI for copyright infringement.This sort of thing seems to be happening a lot lately–one giant tech company or another "moves fast and breaks things," exploring the edges of what might or might not be allowed without first asking permission. On today's show, we try to make sense of what OpenAI allegedly did by training its AI on massive amounts of copyrighted material. Was that good? Was it bad? Was it legal? Help support Planet Money and get bonus episodes by subscribing to Planet Money+ in Apple Podcasts or at plus.npr.org/planetmoney.Learn more about sponsor message choices: podcastchoices.com/adchoicesNPR Privacy Policy
Transcript
Discussion (0)
Before we start, this episode discusses Google and Spotify, which are both corporate sponsors of NPR.
We also discuss OpenAI.
One of OpenAI's major investors is Microsoft, which is also a corporate sponsor of NPR.
Here's the show.
This is Planet Money from NPR.
Douglas Preston got his big break as a writer when he and his co-author published their first novel, Relic, in 1995.
Relic is about a brain-eating monster loose in a museum, hunting down and killing people and eating part of their brains.
So it's, you know, you will not see my name on the list of Nobel laureates, that's for sure.
No Nobel, maybe, but the book was a best my name on the list of Nobel laureates. That's for sure. No Nobel, maybe.
But the book was a bestseller, the first of many.
And how many books have you written altogether?
I'm not sure.
I think about 40.
Douglas also somehow finds time to write all these articles and books about paleontology
and archaeology.
He's got a lot of interest.
He's a curious guy.
paleontology and archaeology. He's got a lot of interest. He's a curious guy. And one day, that curiosity led him to start playing around with the tech world's shiny new
thing, artificial intelligence. Specifically, OpenAI's chatbot, ChatGPT.
Douglas got himself an account and started seeing what this fancy new AI chatbot could do.
While we talked, he scrolled back through his
history and read me some of his earliest queries. I had to write a paragraph about the execution of
Socrates. Please discuss Chopin's Piano Concerto No. 1. Discuss the Transcendental No. E.
Okay, so it appeared to know some math and some history and some music. And it didn't take long
for Douglas to wonder, does it know me?
Specifically, did ChatGBT know anything about the books he had written? So he starts testing it.
Are you familiar with a character called Whittlesey in the novel Relic? Yes, Dr. Whittlesey is one of
the characters in the prologue of the book. He's part of the expedition team that travels to the
Amazon rainforest and makes
a significant discovery which sets the stage for the events that unfold in the story.
Is that answer correct?
Yes.
And Douglas was like, how does it know all that stuff?
The Wikipedia entry on Relic doesn't have this kind of detail.
And Relic was reviewed, but the reviews were never fine-grained like that. The only way
I would know that is if it had ingested the book. Douglas kept going. He asked about other books he
had written. ChatGPT knew that his character, Agent Aloysius Pendergast, had platinum hair,
and how Corey Swanson was a headstrong forensics expert. It was regurgitating everything.
It knew my characters.
It knew their names.
It knew the settings.
It knew everything.
So yeah, it certainly seemed like ChatGBT had access to his full books.
Maybe legitimate digital copies.
Maybe pirated PDFs floating around the internet.
Who knows?
But either way, Douglas owned the copyright to all of his books.
And no one from OpenAI had asked him whether they could use them.
Which raised the question, can they do that?
Hello and welcome to Planet Money.
I'm Keith Romer.
And I'm Erika Barris.
What happened to Douglas Preston feels a little like a thing that keeps happening to all of us.
What happened at Douglas Preston feels a little like a thing that keeps happening to all of us.
One giant tech company or another swoops in and just does a bunch of stuff without our permission,
like keeping track of the websites we visit.
Google, I see you.
Or showing up to a city and setting up a new unregulated kind of taxi service,
even though the city says you can't do that.
Hi, Uber.
It's like the famous Mark Zuckerberg line, move fast and break things. Tech companies have been doing a lot of that.
And the latest example is OpenAI and all those other new AI companies,
hoovering up every last piece of human creativity to build their incredibly powerful computer
programs. Today on the show, we try to get our heads around what OpenAI is up to.
Is it good? Is it bad? Is it legal?
And we'll look back at these two formative legal cases that are super fascinating on their own,
but also offer us a glimpse of how things with OpenAI might turn out.
might turn out. Okay, so we should maybe start by talking a little bit about how ChatGBT works.
It's an interface built on a kind of artificial intelligence called a large language model.
And what that AI does is essentially predict what the next word in a sentence will be,
like autocomplete, but on the grandest scale you can imagine. And to train the AI to do that, computer programmers have to feed it just massive, massive amounts of coherent writing.
The technology is only possible because of all that text that it gobbles up.
What the author Douglas Preston suspected was that a lot of that text came from copyrighted material.
His books and other people's books.
I'll never forget a conversation I had with my friend George R.R. Martin.
And he was really upset.
He said, somebody used ChatGPT to write the final book in my Game of Thrones series.
It's my characters, my settings,
even my voice as an author,
they somehow were able to duplicate using that program.
Douglas and George RR,
they got together with 15 other authors
and decided to sue OpenAI.
Their lawsuit is a class action.
They're suing on behalf of themselves
and any other professional fiction
writers whose work may have been eaten up to create ChatGBT. What evidence do we have
that OpenAI was using copyrighted books in its training sets? Right, that is a really good
question. That is Mary Rasenberger. She is a copyright lawyer and the CEO of the Authors Guild.
Douglas and George Arar and the other authors wound up partnering of the Authors Guild. Douglas and George Arar and the other authors
wound up partnering with the Authors Guild for their lawsuit against OpenAI. They alleged
copyright infringements on an industrial scale. So we do not know because OpenAI, even though
they say they're open, they're quite the contrary. They are about as closed as can be in terms of what their training data sets are.
Is Jonathan Franzen's book The Corrections in the Training Data?
What about My Sister's Keeper by Jodi Picoult or Lincoln and the Bardo by George Saunders?
Those authors, by the way, are all plaintiffs on this lawsuit.
To start building their case, the authors and their lawyers went looking for concrete evidence.
And if the humans at OpenAI wouldn't disclose their training data,
maybe there was a way to trick OpenAI's computer program into giving up its sources.
Some of the lawyers working with the Authors Guild got to work trying to coax ChatGPT into revealing what it knows.
They asked it questions to see how much specific
information it can offer up about any particular book. And of course, when you could get it to
give you back exact text, clearly it had memorized the book. That seems like a strong sign if it can
give you the actual chapter. Yes, yes, yes. Fair. Because of the court case, Mary was a little cagey
about giving exact details here, but other researchers have managed to get ChatGPT to spit up an entire Dr. Seuss book, full chapters of Harry Potter.
Still, to really make the case that OpenAI had, in fact, used all these thousands and thousands of books to train its AI, what the authors really needed was access to the company's records, which Mary says was another reason to sue.
In a lawsuit, you get discovery and presumably will find out what the training data set is
and what was ingested. So, okay, this lawsuit was just filed in September. And so this is kind of
where the author's story pauses for now, because it could take literally years for this to work out.
But like we said before, there are precedents for what happens when a giant tech company snatches up heaps of copyrighted stuff.
Two cases really stand out here. Google decided to scan, like, all of the books and put them on the internet. And case number two,
the time Spotify decided to go ahead and put, you know, all of the songs on the internet.
Okay, let's start with the first case, the one about Google and the books. In some ways,
this is kind of the law's first big brush with the problem of how much copyrighted material
a tech company can scoop up. It is a case that Mary from the Authors Guild
remembers well. So the Google Books case was filed in 2005. Google wanted to create what
some people refer to as a digital library of Alexandria. Yeah, they made all of these deals
with big university libraries around the country that let them come in and add all these books to their giant
searchable Google databases. They had ingested, copied millions of books. They literally were
just taking truckloads of books out of libraries and scanning them. Google had permission from the
libraries to scan the books, but they did not ask permission from the authors. And around 80% of those books
were still protected by copyright. So authors and publishers sued. Now, everyone agreed Google had
copied lots of copyrighted material without permission from the authors. But copyright,
it's not absolute. There are some exceptions. Yeah, copyright law is trying to balance these two interests.
On the one hand, a desire for authors to be allowed to make money from what they've created.
But on the other hand, a desire for the rest of society to sometimes be allowed to borrow and remix and play around with the work of those authors.
The fancy legal name for the kind of copying that the law says is okay is fair use. So the traditional fair uses are things like quoting, quoting from another book
in your book or from a speech commentary. So when you do a critique of a play or a book,
you're going to include perhaps some of the text from it. Copying a song to write a parody, like Weird Al Yankovic style, is also usually fine.
Same with photocopying a couple pages of a novel to teach in an English class.
But what about what Google was doing? Scanning millions and millions of books to create a
searchable database? No one had ever seen anything like that before.
Now, there is no hard and fast rule for what counts as fair use and what doesn't. There are
these four different factors that a judge is supposed to look at to decide whether a certain
act of copying is permissible. Yeah. Is someone going to make money off of it, or are they just
doing it for the sake of doing it? Will it hurt the market for the original work? Is it a big important chunk that is copied or a small one?
And is the thing that was copied transformed somehow into something new?
I will say that the test can be somewhat subjective.
And, you know, the great minds can come out differently sometimes on fair use.
The great minds in the Google case, a judge named Pierre Laval,
he weighed all those fair use. The great mind in the Google case, a judge named Pierre Laval, he weighed all those fair use factors and decided that all that copying Google had done was fair use. It all came down to
the end product Google had created, a giant database of books that people could search
directly that would give them back relevant chunks of these books. It was a way for people
all around the world to access books that
otherwise might have just gathered dust in the basement of a big library somewhere.
Judge LaValle thought that was valuable enough to society that it made all that copying legally okay.
And it's worth pointing out this kind of weird thing about copyright here. Fair use is not this
cut and dry thing. So when a company like Google wants to
play around at the edges of copyright, it has to just dive in without knowing for sure whether or
not the thing they're doing will turn out to be legal. You can't always predict the outcome,
let me say it that way. It is wild that these companies are in some ways incentivized to
take a risk of some amount and see if it works out,
because the courts will decide one way or another eventually.
Well, that's what the tech companies like to do. You know, they like to ask permission later.
Just do, don't tell anyone what you're doing, and then just see what happens.
And just to bring this back to where we started this episode,
that is certainly what it appears OpenAI has done with ChatGPT.
By the way, we reached out to OpenAI.
They declined to comment.
But in court filings, they've made it pretty clear
that they think what they did to train their AI, that was fair use.
Right. Getting a Google Books-type ruling would be a great outcome for OpenAI.
Mary, who is part of the suit against OpenAI, she does not see it that way.
This case is very different than that case because here the harm is so visible.
It's so clear that the marketplace for creators' works will be harmed by generative AI. Which, remember, that is one of
the four factors a judge is supposed to look at in a fair use case. How much will the owners of
the copyright be financially hurt by the copying of their books? It's the commercial use of the works to develop these machines that will spit out very quickly massive quantities of text
that will compete with what they were trained on. That's the issue here. All the Dan Brown novels I
could ever want. Yeah. So, okay. If some judge decides that it is fair use for OpenAI to train ChatGPT on copyrighted material, then like Google Books, that's it.
Sorry, authors.
But what about the other end of the spectrum?
What if a judge says all that copying was against the law?
Thousands of authors with dozens and dozens of books, and each one is a copyright
violation. After the break, we do the math on how much that could cost open AI. And look at the most
likely scenario for how all this plays out. So in the last few years, tech companies have been
basically vacuuming up all of human knowledge and culture to train their AIs.
And lately, some of the creators of all that human knowledge and culture have started pushing back.
Yeah. In addition to the author's lawsuit against OpenAI, there are at least eight other lawsuits brought by songwriters and visual artists and other authors against a bunch of AI companies, all alleging copyright infringement.
And like we talked about before, it's possible that legally all of this is fine,
that some court may decide this is fair use. But it's also possible that they won't.
So in that world, the judge tells OpenAI, your AI is illegal. Shut it down. Well, the thing is, it's not like OpenAI can simply remove the selected works of Douglas Preston and George R.R. Martin from their AI's brain.
The company would have to basically start from scratch and completely retrain their AI.
And then there's the money. So let's run a little back-of-the-envelope math here.
And then there's the money.
So let's run a little back of the envelope math here. The statutory damages for a single act of copyright infringement can reach as high as $150,000 per infringement.
Figure, you know, 10,000 authors, 10 books per author.
You know what that multiplies out to?
$15 billion.
However, it is very unlikely that that will happen, which we will show you through
case number two. Yeah, the time Spotify decided to stream all the songs. This one shows how
sometimes a gigantic lawsuit can actually be a good thing for the tech company getting sued.
To help explain this one, we reached out to UCLA law professor Zian Tang. I guess I would say
that I wanted to be a copyright lawyer from the time I was 16, which sounds really weird.
Let's say unusual. We don't have to say weird. Yes, it's very unusual. Before she was a professor,
Zian worked for a few big law firms. I worked on a Red Bull class action where the claim was like,
you know, Red Bull gives you wings, but it actually doesn't give you wings. There's no more caffeine in it than a cup of coffee. Or like, you know, I bought
this anti-aging product because I thought it would turn back time. And I, you know, I'm 40,
but I thought I would look 18 and I don't. And now I'm suing for it on behalf of myself in a class.
And Zian was one of the lawyers on Spotify's defense team during the big case we're going to talk about.
Right. Spotify had been streaming millions and millions of songs, but they hadn't gotten licenses for all of those songs.
The two main plaintiffs in the lawsuits that then eventually got consolidated into one lawsuit, one was filed by a songwriter named Melissa Farrick and another was filed by a songwriter named Melissa Farrick, and another was filed by a
songwriter named David Lowery. He was in a band, a couple of bands that, you know, I think a lot
of people are familiar with. One was Camper Van Beethoven. One was Cracker.
Erica, are you more of a low fan or more of a What the World Needs Now fan?
I actually don't know either of these bands.
No, no Cracker songs? All right. I'll stay over here on Gen X Island by myself.
I'm Gen Y.
I'm the secret, secret generation that lasted one year after Gen X.
Okay, in any event, the lawsuit basically came down to this.
90% of the songs that Spotify wanted to stream in the U.S.
were managed by a handful of big companies.
And Spotify had signed licensing deals with those companies.
But that left this last 10% of songs that Spotify also wanted to stream.
Spotify hired an outside company to get deals with the copyright holders for those songs.
But someone somewhere along the line dropped the ball.
And even though they didn't end up getting licenses for all those songs,
Spotify went ahead and streamed them anyway.
And so Spotify tried and wanted to do everything right by the books. But the reality is that it's
the music publishers themselves that have really bad data that makes it like near impossible for
someone to figure out who to pay. But that feels like an argument that I would be sympathetic
hearing from my nine-year-old daughter
in terms of, like, I tried to do the right thing, but I couldn't.
But legally, would that hold any water in terms of,
it's not our fault, we couldn't do it, we tried?
So, you know, I think there's a couple parts to your question.
One is, legally, would it hold water?
No. I mean, legally it wouldn't hold
water. Do they have a point? I think they did. And this is where the Spotify case gets really
interesting because Zian says getting sued by those two songwriters was kind of fantastic news
for Spotify. I'm definitely not speaking for Spotify here when I say it's almost a blessing, but it does almost feel like a relief to be able to say, oh, now we have this class that's
established with all these people in it. Let's pay some amount of money that's not going to
bankrupt the business and allow us to say, hey, we're actually paying all these people now,
whereas the allegation was that we weren't before, and we can keep operating.
the allegation was that we weren't before and we can keep operating.
So, I mean, it sounds a little like Spotify's essential problem was not having an opposite side to negotiate with.
And the class action essentially gave them somebody to negotiate with.
Yes.
It's like, you know, yeah, exactly.
We didn't know who to even go out to and talk to about this.
And now these people are popping up out of the woodwork and saying, hey, it's me.
And, you know, I'm thinking about Taylor Swift.
I'm the problem.
It's me.
My daughter listens to Taylor Swift 24 hours a day.
And I was you said those words and I was like, yep, that song's in my head now.
Right.
Yeah, I'm the problem.
Actually, you know, I'm the legal problem.
Negotiate with me.
I mean, put yourself in Spotify's shoes.
There's this 10% of songs that they wanted to license,
but tracking down every indie artist and every indie indie artist
and unspooling the knot of publishing rights, it wasn't going to happen.
And then, one day, these two musicians show up and say,
we represent that entire 10 percent.
Like that is kind of great for Spotify.
In the end, the class action didn't go to trial.
The company and the folks who had songs in that tricky 10 percent ended up reaching a deal.
Spotify agreed to pay them for all its past copyright infringements and set up a system to pay for streaming royalties going forward. And you know, if we were looking for examples
of how the class action by the authors
against OpenAI might play out,
there's a really good chance this is it.
No giant dramatic trial,
just two sides working out a deal.
Ziem has looked pretty deeply
into the history of these kinds of cases.
I did a study where I looked at
every single class action that was filed between basically the advent of the class action mechanism, you know, a century ago to recent date up to the point where the article came out, which was, I think, last year.
In over 100 copyright class actions, only one ever went all the way to a full trial.
And yet they keep being filed.
And they keep being filed.
And, you know, that's why I say it's almost like it's an invitation to settlement, I think.
Uh-huh.
So essentially we have this whole legal theater, which is just the beginning of a negotiation.
Yes, correct.
Yes, correct.
So if you think about the author's lawsuit from OpenAI's perspective, maybe the lawsuit isn't the worst thing.
The company has used all of this copyrighted material, allegedly, hundreds of thousands of books.
There is no good way to unfeed all of those books to their AI.
But also, it would be a huge pain to track down every single author and work out a licensing deal for those books. So maybe this lawsuit will let them do it all in one fell swoop
by negotiating with this handy group of thousands of authors who have collectively sued them.
This episode was produced by Willow Rubin and Sam Yellow Horse Kessler.
It was edited by Kenny Malone and fact-checked by Sierra Juarez.
Engineering by Robert Rodriguez.
Alex Goldmark is our executive producer.
Coming up next week on Planet Money, China's economy is on the brink of a crisis.
And we're going to figure out how they got there. Quick hint, it's real estate. You know, I was in that game. So if, you know,
you're not taking a maximum risk to expand your business empire, next, you know, next year,
you look at your peers and say, like, damn, you know, I only built 10,000 apartments. They already
are selling 15. I'm behind.
That's next week on Planet Money from NPR.
Special thanks today to Danielle Gervais,
Dawa Keeler,
and Douglas Preston's co-author, Lincoln Child.
I'm Keith Romer.
And I'm Erica Barris.
This is NPR.
Thanks for listening.
And a special thanks to our funder, the Alfred P. Sloan Foundation, for helping to support this podcast.