The Daily - A.I.’s Original Sin

Episode Date: April 16, 2024

A Times investigation shows how the country’s biggest technology companies, as they raced to build powerful new artificial intelligence systems, bent and broke the rules from the start.Cade Metz, a ...technology reporter for The Times, explains what he uncovered.Guest: Cade Metz, a technology reporter for The New York Times.Background reading: How tech giants cut corners to harvest data for A.I.What to know about tech companies using A.I. to teach their own A.I.For more information on today’s episode, visit nytimes.com/thedaily. Transcripts of each episode will be made available by the next workday.

Transcript
Discussion (0)
Starting point is 00:00:01 From The New York Times, I'm Michael Barbaro. This is The Daily. Today, a Times investigation shows how as the country's biggest technology companies raced to build powerful new artificial intelligence systems, they bent and broke the rules from the start. My colleague Cade Metz on what he uncovered. It's Tuesday, April 16th. Cade, when we think about all the artificial intelligence products released over the past couple of years, including, of course, these chatbots we've talked a lot about on the show. We so frequently talk about their future, their future capabilities, their influence on
Starting point is 00:01:11 society, jobs, our lives. But you recently decided to go back in time to AI's past, to its origins, to understand the decisions that were made basically at the birth of this technology. So why did you decide to do that? Because if you're thinking about the future of these chatbots, that is defined by their past. The thing you have to realize is that these chatbots learn their skills by analyzing enormous amounts of digital data. So what my colleagues and I wanted to do with our investigation was really focus on that effort to gather more data. We wanted to look at the type of data these companies were collecting, how they were gathering it, and how they were feeding it into their systems. And when you all undertake this line of reporting,
Starting point is 00:02:11 what do you end up finding? We found that three major players in this race, OpenAI, Google, and Meta, as they were locked into this competition to develop better and better artificial intelligence, they were willing to do almost anything to get their hands on this data, including ignoring and in some cases violating corporate rules and wading into a legal gray area as they gathered this data. Basically, cutting corners. Cutting corners left and right. Okay, let's start with OpenAI, the flashiest player of all. The most interesting thing we found is that in late 2021, as OpenAI, the startup in San Francisco that built ChatGPT, as they were pulling together the fundamental technology that would power that chatbot, they ran out of data, essentially.
Starting point is 00:03:14 They had used just about all the respectable English language text on the internet to build this system. And just let that sink in for a bit. I mean, I'm trying to let that sink in. They basically, like a Pac-Man on an old game, just consumed almost all the English words on the internet, which is kind of unfathomable. Wikipedia articles by the thousands, news articles, Reddit threads, digital books by the millions. We're talking about hundreds of billions, even trillions of words. Wow. So by the end of 2021, OpenAI had no more English language text that they could feed into these systems.
Starting point is 00:04:04 But their ambitions are such that they wanted even more. So here, we should remember that if you're gathering up all the English language text on the internet, a large portion of that is going to be copyrighted. Right. a large portion of that is going to be copyrighted. Right. So if you're one of these companies gathering data at that scale, you are absolutely gathering copyrighted data as well. Which suggests that from the very beginning, these companies, a company like OpenAI with ChatGPT,
Starting point is 00:04:41 is starting to break, bend the rules. Yes, they are determined to build this technology. Thus, they are willing to venture into what is a legal gray area. So given that, what does OpenAI do once it, as you had said, runs out of English language words to mop up and feed into this system. So they get together and they say, all right, so what are our other options here? And they say, well, what about all the audio and video on the internet? We could transcribe all the audio and video, turn it into text, and feed that into their
Starting point is 00:05:22 systems. Interesting. Hmm. into text with high accuracy. And then they gathered up all sorts of audio files from across the internet, including audio books, podcasts, and most importantly, YouTube videos. Of which there's a seemingly endless supply, right? Fair to say maybe tens of millions of videos. According to my reporting, we're talking about at least a million hours of YouTube videos were scraped off of that video sharing site, fed into this speech recognition system in order to produce new text for training OpenAI's chatbot. And YouTube's terms of service do not allow a company like OpenAI to do this. YouTube, which is owned by Google, explicitly says you are not allowed to, in internet parlance, scrape videos en masse from across YouTube and use those videos to build a new application. That is exactly what OpenAI did. According to my reporting, employees at the
Starting point is 00:06:56 company knew that it broke YouTube terms of service, but they resolved to do it anyway. So, Cade, this makes me want to understand what's going on over at Google, which, as we have talked about in the past on the show, is itself thinking about and developing its own artificial intelligence model and product. Well, as OpenAI scrapes up all these YouTube videos and starts to use them to build their chatbot. According to my reporting, some employees at Google, at the very least, are aware that this is happening. They are? Yes.
Starting point is 00:07:35 Now, when we went to the company about this, a Google spokesman said it did not know that OpenAI was scraping YouTube content and said the company takes legal action over this kind of thing when there's a clear reason to do so. But according to my reporting, at least some Google employees turned a blind eye to OpenAI's activities because Google was also using YouTube content to train its AI. Wow. So if they raise a stink about what OpenAI is doing, they end up shining a spotlight on themselves, and they don't want to do that. I guess I want to understand what Google's relationship is to YouTube,
Starting point is 00:08:17 because, of course, Google owns YouTube. So what is it allowed or not allowed to do when it comes to feeding YouTube data into Google's AI models? It's an important distinction. Because Google owns YouTube, it defines what can be done with that data. And Google argues that it has a right to that data, that its terms of service allow it to use that data. However, because of that copyright issue, because the copyright to those videos belong to you and I, lawyers who I've spoken to say people could take Google to court and try to determine
Starting point is 00:09:01 whether or not those terms of service really allow Google to do this. There's another legal gray area here where, although Google argues that it's okay, others may argue it's not. Of course, what makes this all so interesting is you essentially have one tech company, Google, keeping another tech company, OpenAI's dirty little secret about basically stealing from YouTube because it doesn't want people to know that it too is taking from YouTube. And so these companies are essentially enabling each other as they simultaneously seem to be bending or breaking the rules. What this shows is that there is this belief, and it has been there for years within these
Starting point is 00:09:48 companies among their researchers, that they have a right to this data because they're on a larger mission to build a technology that they believe will transform the world. And if you really want to understand this attitude, you can look at our reporting from inside Meta. And so what does Meta end up doing, according to your reporting? Well, like Google and other companies, Meta had to scramble to build artificial intelligence that could compete with open AI. to build artificial intelligence that could compete with open AI. Mark Zuckerberg is calling engineers and executives at all hours, pushing them to acquire this data that is needed to improve the chatbot.
Starting point is 00:10:47 And at one point, my colleagues and I got hold of recordings of these meta executives and engineers discussing this problem, how they could get their hands on more data, where they should try to find it. And they explored all sorts of options. They talked about licensing books one by one at $10 a pop and feeding those into the model. They even discussed acquiring the book publisher, Simon & Schuster, and feeding its entire library into their AI model. But ultimately, they decided all that was just too cumbersome, too time-consuming. And on the recordings of these meetings, you can hear executives talk about how they were willing to run roughshod over copyright law and ignore the legal concerns and go ahead and scrape the Internet and feed this stuff into their models. that they might be sued over this, but they talked about how OpenAI had done this before them,
Starting point is 00:11:46 that they, Meta, were just following what they saw as a market precedent. Interesting. So they go from having conversations like, should we buy a publisher that has tons of copyrighted material, suggesting that they're very conscious of the kind of legal terrain
Starting point is 00:12:02 and what's right and what's wrong, and instead say, nah, let's just follow the open AI model, that blueprint, and just do what we want to do, do what we think we have a right to do, which is to kind of just gobble up all this material across the internet. It's a snapshot of that Silicon Valley attitude that we talked about. of that Silicon Valley attitude that we talked about. Because they believe they are building this transformative technology, because they are in this intensely competitive situation where money and power is at stake, they are willing to go there.
Starting point is 00:12:43 But what that means is that there is, at the birth of this technology, a kind of original sin that can't really be erased. It can't be erased, and people are beginning to notice, and they are beginning to sue these companies over it. These companies have to have this copyrighted data to build their systems. It is fundamental to their creation. If a lawsuit bars them from using that copyrighted data, that could bring down this technology. We'll be right back. So Cade, walk us through these lawsuits that are being filed against these AI companies based on the decisions they made early on to use technology as they did and the chances that it could result in these companies
Starting point is 00:13:50 not being able to get the data they so desperately say they need. These suits are coming from a wide range of places. They're coming from computer programmers who are concerned that their computer programs have been fed into these systems. They're coming from computer programmers who are concerned that their computer programs have been fed into these systems. They're coming from book authors who have seen their books being used. They're coming from publishing companies. They're coming from news corporations like the New York Times, incidentally, which has filed a lawsuit against OpenAI and Microsoft.
Starting point is 00:14:45 Incidentally, which has filed a lawsuit against OpenAI and Microsoft, news organizations that are concerned over their news articles being used to build these systems. from that lawsuit. That lawsuit was filed by the business side of the New York Times, by people who are not involved in your reporting or in this daily episode, just to get that out of the way. Exactly. I'm assuming that you have spoken to many lawyers about this, and I wonder if there's some insight
Starting point is 00:15:00 that you can shed on the basic legal terrain. I mean, do the companies seem to have a strong case that they have a right to this information, or do companies like the Times who are suing them seem to have a pretty strong case that, no, that decision violates their copyrighted materials? Like so many legal questions, this is incredibly complicated. It comes down to what's called fair use, which is a part of copyright law that determines whether companies can use copyrighted data to build new things. And there are many factors that go into this. There are good arguments on the OpenAI side. There are good arguments on the
Starting point is 00:15:45 New York Times side. Copyright law says that you can't take my work and reproduce it and sell it to someone. That's not allowed. But what's called fair use does allow companies and individuals to use copyrighted works in part. They can take snippets of it. They can take the copyrighted works and transform it into something new. That is what OpenAI and others are arguing they're doing. But there are other things to consider. but there are other things to consider. Does that transformative work compete with the individuals and companies that supply the data, that own the copyrights? Interesting.
Starting point is 00:16:34 And here, the suit between the New York Times company and OpenAI is illustrative. If the New York Times creates articles that are then used to build a chatbot, does that chatbot end up competing with the New York Times? Do people end up going to that chatbot for their information rather than going to the Times website and actually reading the article. That is one of the questions that will end up deciding this case and cases like it. So what would it mean for these AI companies, for some or even all of these lawsuits to succeed? succeed? Well, if these tech companies are required to license the copyrighted data that goes into their systems, if they're required to pay for it, that becomes a problem for these
Starting point is 00:17:35 companies. We're talking about digital data the size of the entire internet. Licensing all that copyrighted data is not necessarily feasible. We quote the venture capital firm Andreessen Horowitz in our story where one of their lawyers says that it does not work for these companies to license that data. It's too expensive. It's on too large a scale. It would essentially make this technology economically impractical. Exactly. So a jury or a judge or a law ruling against open AI could fundamentally change the way
Starting point is 00:18:21 this technology is built. The extreme case is these companies are no longer allowed to use copyrighted material in building these chatbots, and that means they have to start from scratch. They have to rebuild everything they've built. So this is something that not only imperils what they have today, it imperils what they want to build in the future. And conversely, what happens if the courts rule in favor of these companies and say, you know what, this is fair use.
Starting point is 00:18:55 You were fine to have scraped this material and to keep borrowing this material into the future, free of charge? Well, one significant roadblock drops for these companies, and they can continue to gather up all that extra data, including images and sounds and videos, and build increasingly powerful systems. But the thing is, even if they can access as much copyrighted material as they want, these companies may still run into a problem. Pretty soon, they're going to run out of digital data on the internet.
Starting point is 00:19:36 That human-created data they rely on is going to dry up. They're using up this data faster than humans create it. One research organization estimates that by 2026, these companies will run out of viable data on the internet. Wow. Well, in that case, what would these tech companies do? I mean, where are they going to go if they've already scraped YouTube,
Starting point is 00:20:04 if they've already scraped YouTube, if they've already scraped podcasts, if they've already gobbled up the internet, and that altogether is not sufficient? What many people inside these companies will tell you, including Sam Altman, the chief executive of OpenAI, they'll tell you that what they will turn to is what's called synthetic data. And what is that? That is data generated by an AI model that is then used to build a better AI model. It's AI helping to build better AI. That is the vision ultimately they have for the future that they won't need all this human generated text. They'll just have the AI
Starting point is 00:20:55 build the text that will feed future versions of AI. So they will feed the AI systems the material that the AI systems themselves create. But is that really a workable, solid plan? Is that considered high-quality data? Is that good enough? If you do this on a large scale, you quickly run into problems. As we all know, as we've discussed on this podcast, these systems make mistakes. They hallucinate. They make stuff up.
Starting point is 00:21:35 They show biases that they've learned from internet data. And if you start using the data generated by the AI to build new AI, those mistakes start to reinforce themselves. The systems start to get trapped in these cul-de-sacs where they end up not getting better, but getting worse. What you're really saying is these AI machines need the unique perfection of the human creative mind. this, that if you have an AI system that is sufficiently powerful, if you make a copy of it, if you have two of these AI models, one can produce new data and the other one can judge that data. It can curate that data as a human would. It can provide the human judgment, so to speak. So as one model produces the data, the other one can judge it, discard the bad data, and keep the good data. And that's how they ultimately see these systems
Starting point is 00:22:53 creating viable synthetic data. But that has not happened yet, and it's unclear whether it will work. It feels like the real lesson of your investigation is that if you have to allegedly steal data to feed your AI model and make it economically feasible, then maybe you have a pretty broken model. And that if you need to create fake data as a result, which, as you just said, kind of undermines AI's goal of mimicking human thinking and language, then maybe you really have a broken model. And so that makes me wonder if the folks you talk to, the companies that we're focused on here, ever ask themselves the question,
Starting point is 00:23:38 could we do this differently? Could we create an AI model that just needs a lot less data? could we create an AI model that just needs a lot less data? They have thought about other models for decades. The thing to realize here is that is much easier said than done. We're talking about creating systems that can mimic the human brain. That is an incredibly ambitious task. And after struggling with that for decades, these companies have finally stumbled on something that they feel works, that is a path to that incredibly ambitious goal. And they're going to continue to push in that direction.
Starting point is 00:24:22 Yes, they're exploring other options, but those other options aren't working. What works is more data and more data and more data. And because they see a path there, they're going to continue down that path. And if there are roadblocks there and they think they can knock them down, they're going to knock them down. But what if the tech companies never get enough or make enough data to get where they think they want to go, even as they're knocking down walls along the way? That does seem like a real possibility. If these companies can't get their hands on more data, then these technologies, as they're built today, stop improving. We will see their limitations. We will see how difficult it really is to build a system that can match, let alone surpass, the human brain.
Starting point is 00:25:18 These companies will be forced to look for other options technically, and we will see the limitations of these grandiose visions that they have for the future of artificial intelligence. Okay. Thank you very much. We appreciate that. Glad to be here. We'll be right back. Here's what else you need to know today. Israeli leaders spent Monday debating whether and how to retaliate against Iran's
Starting point is 00:26:08 missile and drone attack over the weekend. Herzli Halevi, Israel's military chief of staff, declared that the attack will be responded to. In Washington, a spokesman for the U.S. State Department, Matthew Miller, reiterated
Starting point is 00:26:24 American calls for restraint. Of course, we continue to make clear to everyone that we talk to that we want to see de-escalation, that we don't want to see a wider regional war. But emphasized that a final call about retaliation was up to Israel. Israel is a sovereign country. They have to make their own decisions about how best to defend themselves. And the first criminal trial of a former U.S. president officially got underway on Monday in a Manhattan courtroom. Donald Trump, on trial for allegedly falsifying documents
Starting point is 00:27:00 to cover up a sex scandal involving a porn star, watched as jury selection began. The initial pool of 96 jurors quickly dwindled. More than half of them were dismissed after indicating that they did not believe that they could be impartial. The day ended without a single juror being chosen. Today's episode was produced by Stella Tan, Michael Simon-Johnson, Muj Zaydi, and Ricky
Starting point is 00:27:32 Nowetzki. It was edited by Mark George and Liz O'Balin, contains original music by Diane Wong, Dan Powell, and Pat McCusker, and was engineered by Chris Wood. Our theme music is by Jim Brunberg and Ben Landsfolk of Wonderland. That's it for The Daily. I'm Michael Barbaro. See you tomorrow.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.