Endless Thread - Introducing On Point: The Internet Archive is in danger

Episode Date: February 21, 2025

We’re coming to you with a special offering today. It’s an episode about the internet… from our friends just a few cubicles over here at WBUR: On Point. Hosted by Meghna Chakrabarti, On Point is... a unique, curiosity-driven combination of original reporting, newsmaker interviews, first-person stories, and in-depth analysis, making the world more intelligible and humane. When the world is more complicated than ever, we aim to make sense of it together. We loved their recent episode about one of our favorite pieces of how the internet gets recorded and remembered — and we thought you might love it too.  So kick back and take a listen. We’ll bring you the usual shenanigans next week. More than 900 billion webpages are preserved on The Wayback Machine, a history of humanity online. Now, copyright lawsuits could wipe it out. Guests Brewster Kahle, founder and director of the Internet Archive. Digital librarian and computer engineer. James Grimmelmann, professor of digital and information law at Cornell Tech and Cornell Law School. Studies how laws regulating software affect freedom, wealth, and power.

Transcript
Discussion (0)
Starting point is 00:00:00 Support for endless thread comes from Mathworks, creator of MATLAB and Simulink Software, to design and develop engineered systems, accelerating the pace of discovery in engineering and science. Learn more at Mathworks.com. Support for WBUR comes from Is Business Broken, a podcast from the Mayrotra Institute at Boston University that explores questions like, why is innovation in healthcare so hard? Is ESG just greenwashing? And, of course, is business broken? Listen, wherever you get your podcasts.
Starting point is 00:00:35 What's up, endless threaders? It's your buddy, Benny Brock Johnson. We're coming to you with a special offering today. It is an episode about the internet from our friends, just a few cubicles over here at WBUR on point. We listened to their recent episode about one of our favorite pieces of how the internet gets recorded and remembered.
Starting point is 00:00:57 And we thought you guys might love it. So kick back, take a listen, and we'll bring you the usual shenanigans next week. Also, if you like it, do yourself a favor and subscribe to On Point. Here it is. This is On Point. I'm Megyn Chakrabardi. Republican Congresswoman Elise Stefani represents New York's 21st Congressional District. She is one of President-elect Donald Trump's most loyal supporters.
Starting point is 00:01:30 and that loyalty has been rewarded. Trump has picked Stefonic to be the next U.S. ambassador to the United Nations when his new administration takes office in 13 days. The certification of Trump's win went smoothly. Unlike 2021, when Trump's supporters violently attacked police officers, defecated in the halls of Congress, and forced the halt of the peaceful certification of a free and fair election. On that day, January 6th,
Starting point is 00:02:00 Trump, in his first presidency, sat in the Oval Office and watched the entire attack unfold on television. So this brings me back to Representative Stefonic. Last year, she was on NBC's Meet the Press. I have concerns about the treatment of January 6 hostages. I have concerns we have a role in Congress of Oversight. And I believe that we're seeing the weaponization of the federal government against not just, President Trump, but we're seeing it against conservatives. We're seeing it against Catholics.
Starting point is 00:02:34 Now, this is quite different from what she said on the day of the attack four years ago. On January 6, 2021, Stefanik took to the House floor and made this statement. Americans will always have the freedom of speech and the constitutional right to protest, but violence in any form is absolutely unacceptable. It is anti-American and must be prosecuted to the fullest extent of the law. Now, we got that clip from C-SPAN. Stefanik also published a written statement on January 6, 2021, and posted it on her congressional website. That statement reads in part, quote,
Starting point is 00:03:15 I fully condemn the dangerous violence and destruction that occurred today at the United States Capitol. Violence in any form is absolutely unacceptable and anti-American, end quote. That was on her website. January 6th, 2021. But you know what? I'm going to pull up my computer here. Here I go. Here it is right here.
Starting point is 00:03:37 If you look for that statement today, and if you happen to be near a computer and you want to do this, I'm telling you, you will not find that statement. Here is the URL for where that statement once was, okay? And you can try it. I'm going to do it right here. H-T-P-S. Okay, secure, right?
Starting point is 00:03:55 And then you go Stefanik, which is Steffa-F-A-N. N-I-K. Dot house.gov slash 2021-1-1-S-E-Safon-S-E-O-S-EFon-E-O-LANIC. Again. Dash statement, dash violence,
Starting point is 00:04:17 united, dash, states, dash capital. Love those SEO URLs. Okay, so then hit go, enter. I wonder if you can, got what I got. I got a website that says, error. That statement on Congresswoman's Defonics website has been taken down. In the 21st century, we use the internet as our history
Starting point is 00:04:44 book, our entire storage and filing system to document the story of ourselves and our nation. So what happens when that story can be so easily erased? Well, there is one place where that record remains preserved. It's called the Wayback Machine, and it's run by the nonprofit Internet Archive. That is how I found Stefanik's 2021 statement. So what happens? If the Internet Archive, the Wayback machine itself, ceased to exist, while due to some very important court cases active right now, in reality, a world without the Internet Archive is not impossible to imagine. Joining me now is Brewster Kale. He's the founder of the Internet Archive and a digital librarian and computer engineer. Brewster Kale, welcome to On Point. Oh, great to be here. Thank you, Migna.
Starting point is 00:05:39 What do you imagine the world might be like if the Internet Archive or the Wayback Machine ceased to exist? The Internet Archive is, but the only real public record of the broad worldwide web, but also all sorts of other things like old television, old books that are all cooperative effort of thousands of libraries to build a record of our time and make it as publicly available as we can. That's what the Internet Archive is. Archive.org is a free service. It's used by millions of people a day. It's about the 200th most popular website. People want old stuff. What would we do if we didn't have access to that old stuff that has only ever existed in this digital format. I mean, do you dare imagine what that would be like?
Starting point is 00:06:29 It would mean that people wouldn't be able to be held as accountable for what it is, they said. But I think more broadly, they just wouldn't be able to remember. We get emails all the time from people just being so delighted that their old websites are still around. I mean, the World Wide Web is kind of magic in making it so that everyone can be a publisher. But Tim Berners-Lee's system of the World Wide Web was kind of too simple. It only comes from one place. And that one place can be changed or deleted at any time. So we need a record.
Starting point is 00:07:08 We need a vibrant library system. And that's what's a threat here. I definitely accept the argument that the first ever web page that I made back in my college days with the little like dancing gifts of Bart Sins. That doesn't necessarily need preservation. Can you remind me when did you come up with the idea of the Internet Archive? The Internet Archive started in 1996 in the early days of the web. It was to basically build the library where it had been dreaming of forever and I'd been working on since 1980.
Starting point is 00:07:41 But others had been working on for actually much longer. The Library of Alexandria is a constantly renewed mythological goal. of trying to make the published works of humankind available. So for me, it was a pretty obvious step that we needed to do this. And so we created the World Wide Web a little bit in a crafty way. But now we have the Wayback Machine to help fill in some of those problems. Well, Brewster Kale, stand by for just a moment, because when we talk back, come back, we're going to talk more about how the Internet Archive and the Wayback Machine works.
Starting point is 00:08:16 And then, of course, we will dive deep into these court cases that bring into question the existence of this archive itself. This is on point. Today we are talking about the Internet Archive and its way back machine and the court cases that could potentially threaten the existence of the digital world's most important archive.
Starting point is 00:08:47 And Brewster Kale joins us. He's the founder and director of the Internet Archive. He's a digital librarian and a computer engineer as well. So, Brewster, how does it work? Like, how are you storing 900 billion web pages. How do you do it? Oh, it's just miracles of current computer. So we own our own computers at the Internet Archive.
Starting point is 00:09:08 It's not in some cloud someplace, which is somebody else's computers. Libraries take preservation very seriously. And there are about 1,300 libraries, including the National Archives and Library of Congress that basically say crawl these at this frequencies. We collect over one billion URLs every day. 1 billion. And those go and are stored in their full original form on hard drives, and then they're indexed to be the Wayback Machine. So if you go to the Wayback Machine at archive.org, you can just type in a URL and see past versions and see the web as it was.
Starting point is 00:09:51 So if you click on, say, a 2001 political website, you'll go and click around that world as it existed then by pulling it out of the archive. And you could see all of the changes that were made to every URL when they disappeared or whatever. It's used by millions of people a day. That's what I did for this press release that Congresswoman Stefonic released in 2021. And then there's a little sort of, there's like a timeline, right, at the top of the page that shows all the times that that page was scraped by the wayback machine. And does it show even the changes to the page every time? Yes, you can pack on the upper right, you can click to see if they've changed a word or phrase,
Starting point is 00:10:36 but often it's just pages just completely disappear. Past editions have always been very important. It's the memory hole problem. It's the 1984 nightmare of being able to go back and change recorded history. And libraries as being third party, nonprofit, public, services have always played a role in making a record and making that publicly available as well. So where do you get the funding for this? Because it seems like a very large undertaking. It is and it isn't. But yes, we get about one-third of our income from libraries paying us to
Starting point is 00:11:22 collect web pages or digitize books and records for them. About one-third from major donors and foundations, and about one-third from end users, and the same kind of MPR kind of beg-a-thon at the end of the year of, please, please, please. And we have over 150,000 people a year that go and say, I want to support access to history. And so it's about a $20 million, 25 million a year organization. Wikipedia is about 10 times that, but both of these are less than the San Francisco Public Library. We're tiny by comparison. So it is possible with these digital technology
Starting point is 00:12:06 to make copies of these materials, preserve them, and then even put them in other locations for long-term storage. So it's not just websites, though, right? Correct me if I'm wrong, but it's everything that's digitally available through the web?
Starting point is 00:12:23 We can't keep up with everything, say, in YouTube, for instance, but if it's linked to or linked from a tweet, then we try to get it. But also, we try to record television, worldwide television, in cooperation with libraries around the world, and try to make that searchable. So the C-SPAN that you were referring to is also recorded.
Starting point is 00:12:50 We can send you a thumb drive and you can borrow that program. If you want to reuse it for your documentary, then you have to go and license it or something. But it's available as a library as a record. And it's very important that it's not just from one place, because those are too easily manipulated and they go out of business all the time. The Internet Archive is kind of the place that websites that are long since dead, GeoCities, old people's blogs, the SoundCloud. the band camps, those hold fantastic works, creative works of people that they love to be able to get back.
Starting point is 00:13:32 Yeah. So music, you mentioned, television, books. This is all stuff that intellectually also belongs to people. And as you said at the very beginning of the show, one of the goals of the Internet Archive is to not only preserve this material, but make it accessible to everyone. Isn't there a conflict there, right? Because, I mean, we hear all the time about artists having their stuff listened to or read, but not receiving a single penny of remuneration from that. That's how, well, publishing has always worked. They basically, in the old days back when we were growing up, publishers would make copies, sell them to libraries and individuals. Then these libraries would preserve them because they pay for the works.
Starting point is 00:14:18 The question is, how do we move into this digital era? So Brewster, hang on here for just a second because, as promised, we need to talk about these court cases that have come up regarding what the Internet Archive does. And in order to get sort of the legal view on that, I'm going to bring James Grimmelman into the conversation. He's a professor of digital and information law at Cornell Tech and Cornell Law School. and he studies how laws regulating software affect freedom, wealth, and power. Professor Grimmelman, welcome to you on point. Hi, it's great to be here. There's really two cases we need to talk about.
Starting point is 00:14:56 The first one is Hashet versus the Internet Archive. Tell us about that case. So this is a case about the Internet Archive's use of book scans. The Internet Archive, in collaboration with other libraries, and like many organizations, has been digitizing, books. They get a physical copy of it. They put it in a book scanner. They take photographs of each of the pages. They recognize what the words on. And now they have a digital record of what used to be in a physical book. So Brewster was talking before about preserving
Starting point is 00:15:32 the web. Those are things that were accessible online at one point. But when you're talking about physical books, they've never been previously available digitally. And so this is an additional way of having archival copies of them that can be preserved. So in addition to digitizing the books for preservation, the Internet Archive also made them available to people for reading, the same way that a library with physical books would. You log in with your account, you check out the book, and then it's available to you to read on your computer until the end of your borrowing period. The idea is that people circulate a copy of this book in the same way that a library would circulate a physical book from its shelves. But Brewster, is that what the Internet Archive is doing with its digital books?
Starting point is 00:16:21 Because the Court found the Internet Archive basically in violation of copyright law with its book scanning program. Yes, the Internet Archive, working with other libraries, basically has a physical copy, keeps that aside, and then lends the photographs of these books to one reader at a time. So the libraries actually in the e-book world never get a copy. They pay and pay and pay, but they've never bought a copy. Digital ownership is key here. So mostly the Internet Archives collections are old 20th century materials. We link them into Wikipedia so that people can go and look at the Wikipedia links.
Starting point is 00:17:05 they just get a snippet. And then if they want to see more, they have to borrow or buy the book. Okay. But the Internet Archive, to be clear, lost this case. So let me provide a little bit more background. I believe it was first filed in June of 2020 in the Southern District of New York. It wasn't just hash at the publisher. It was also Harper Collins, Penguin Random House, and Wiley.
Starting point is 00:17:27 Basically what was found by a lower court judge and then affirmed by the Second Circuit was that, I'll quote, the ruling here, is it fair use for a nonprofit organization to scan copyright-protected print books in their entirety and distribute those digital copies online in full for free, subject to a one-to-one-one-one-to-loan ratio between its print copies and the digital copies it makes available at any given time? All without authorization from the copyright holding publishers or authors. Well, the court applied the Copyright Act, and they said the answer is no. So Professor Grymellman, explain this ruling to me, what in the eyes of the Second Circuit, because the Internet Archive declined to appeal up to the Supreme Court, what is the violation here of the
Starting point is 00:18:16 Copyright Act? The issue is that the Internet Archives lending program looks and works a lot like traditional library book lending, but technically there are a bunch of computer implementation details that are different. and the court thought that those details make it fundamentally unlike library lending and not protected. So libraries have always relied on another copyright defense, first sale. Once you buy a copy of a book, it is yours to sell, give away, or lend out as you see fit. So libraries would always buy books, and then first sale would protect their right to lend them out to any of their patrons.
Starting point is 00:19:00 The issue is that when you go to the digital world, the Internet Archive isn't distributing a physical artifact, like a book with paper and ink, to its readers. It's giving them digital access. And the way that computers work when you want to give digital access to a file, it involves making a copy of the bits on that file on a different computer. And so the publishers argued and successfully persuaded the court that this is making a separate copy from the original,
Starting point is 00:19:34 either the book paper form or the file on the archive servers, and that that additional copy triggers copyright law and isn't protected by first sale. Okay. Now, we've got to take a quick break here, but when we come back, I want to get your response to that, Brewster, and then we'll talk about how the music industry is also applying legal pressure to the Internet Archive.
Starting point is 00:19:57 That's all in just a moment. This is on point. Support for this podcast comes from Nature is the Solution, a podcast from the Nature Conservancy. When it comes to the environment, it's easy to focus on doom and gloom, but that's not the whole story, especially when there are so many projects
Starting point is 00:20:27 working towards bringing people and nature together. In this moment, optimism isn't naive. It's necessary. Follow Nature is the solution wherever you listen to podcast, and discover stories of impact and possibility. At Radio Lab, we love nothing more than nerding out about science, neuroscience, chemistry. But we do also like to get into other kinds of stories, stories about policing or politics, country music, hockey, sex, of bugs. Regardless of whether we're looking at science or not science, we bring a rigorous curiosity to get you the answers.
Starting point is 00:21:08 And hopefully, make you see the world anew. Radio Lab, Adventures on the Edge of what we think we know. Wherever you get your podcasts. There is something powerful about the sound of the human voice. Beautifully produced audio has the unique power to connect and inspire. Tell your organization's story with a custom podcast from CitySpace Productions, the creative studio from WBUR's Business Partnerships Team. Become a thought leader.
Starting point is 00:21:34 Recruit new talent. Reach new audiences. Whatever your goal, we can help. Discover how the math. Magic is made at WBUR.org slash creative studio. Going back to that statement from the publishers when they said, basically, I'm going to paraphrase here, there's no difference between what the Internet Archive is doing by scanning these books and slapping a book down on a photocopier and pressing copy.
Starting point is 00:22:04 And in every book, it says you cannot do that. You cannot make copies of this book. So, I mean, do you have an argument against that? I think the bigger picture here is that, yes, there was a New York court that sided with the publishers, but other courts side with libraries in Europe. When almost this exact same case came up of lending digitized books from libraries, all of Europe affirmed it both at the local level in Holland and at the European level. China has allowed digitizing and lending for 15 years.
Starting point is 00:22:45 India, also concerned with educating their public and supporting libraries, has also been supportive of educational exemptions. In the United States, 100 years ago, led in libraries. And you have to remember that the publishers in general sue libraries over and over again about things like lending and have forever. But the legislatures and the judiciary in the United States 100 years ago said it was important to have libraries and archives. And we made the Carnegie Library System. What countries are going to lead in libraries is really unknown.
Starting point is 00:23:23 But that's the big question. You know, philosophically, I agree with you. I am a giant proponent of making information easily accessible for the good of the general public. But James Grimmelman, let me turn back to you here. It doesn't seem to me that that is what this case is about. I mean, as you said, it's about the existence of that one digital copy. I mean, were the publishers interest as narrow as that, or did they actually have some sort of other strategy in play?
Starting point is 00:23:52 Are they fearful that the Internet Archive, instead of doing those, like, Brewster said those dusty musties about World War II history from 70 years ago are that they're going to move into digitizing Harry Potter? Yeah, I think the publishers are concerned about what they see as a principal and a slippery slope. The Internet Archive is not going to put them out of business, but they're afraid that there will be lots of other libraries that have lower standards and take fewer safeguards and work with books that are front list titles, and that if they don't sue everyone who is crossing their radar, that eventually they won't be able to enforce any. restrictions because everyone will just download free copies from the Internet of everything. Well, I just want to make a note that we did reach out to representatives and the legal team from the American Association of Publishers, and they did not respond, but we did have that
Starting point is 00:24:47 statement that I read earlier. That case is complete. I just want to remind everyone, the Internet Archive declined to appeal to the Supreme Court, so the Second Circuit's ruling stands. But Professor Grimmelman, on the music side of things, there's another case, Universal Music Group. And they are taking... aim at the Internet Archives' Great 78 Project. What is this case about? This case is about another Internet Archives' efforts to digitize and make available. Another source of old media, in this case, early 78 records.
Starting point is 00:25:23 So 78 RPMs, this is the first major generation of widespread commercial records. and it's an amazing history of the early sounds of recorded music. And so the archive took lots of these old ones, went to great effort to make digital versions of them, and put those on their websites so people can experience them without the risk of working with finding and potentially destroying extremely old and fragile records. And to be clear, these are not in the public domain?
Starting point is 00:25:56 Well, there's a mixture, because some of these would be works that would now, be public domain. Some of them were not part of the federal copyright system, but were added to it by the recent Music Modernization Act. It's a very diverse and in some ways legally complicated set of works. Okay. So what's the issue here, though? Is it the same thing that the Internet Archive is making this digital copy of these records and that the existence of that copy in and of itself is the problem? Or is it the fact that now, many more people have access to it? No, it's in some ways they're very similar cases. There are some legal differences between the two due to the different status of music and the copyright system, and there's some technical differences. But it's not fundamentally different in kind. It's an objection by copyright owners that other people are making digital archives and then giving the public access to those archives.
Starting point is 00:26:56 Okay, so we reached out to Universal Music Group. They did not respond. But we also reached out to the Recording Industry Association of America, RIA, and their chief legal officer, Ken Dorosho, sent us back this statement. And he actually talked about some of the legal differences here. He says, quote, Congress took decisive action to protect pre-1972 recordings in the Music Modernization Act. And then he says, the Internet Archives quote-unquote mass scale copying, streaming and distribution of the thousands of pre-1972 recordings are blatant violations of those established rights. How do you read that, Professor Grimelman?
Starting point is 00:27:35 I think he's saying that the Music Modernization Act singled out music as special for extra protection. I don't know if that's right. There are some ways in music gets a little bit heightened protection in U.S. copyright law and a lot of ways in which gets less than other kinds of works. The MMA reduced some of those disparities, but I wouldn't say that it elevated music and old recordings above it. everything else. Well, Brewster, the Recording Industry Association of America calls your Great 78 Project, quote, yet another mass infringement scheme that has no basis in law. What's your response to that? We're a library. This project is a combination of 100 different libraries and collections over that have participated in building this collection. So one of the
Starting point is 00:28:28 first collections that came in was from the Boston Public Library. library that this collection of 78 RPM records is actually before vinyl. This is the old Shalak recordings. You know, we have to wind it up with the horn, the dog. They stopped being viable in 1950. And people don't even have the players for these. So to understand what America sounded like, you actually had to either go and find these things and then somewhat destroy them by putting them on these old record players, winding them up and listening to them in a crifty way. And people just weren't doing it. So in general, the ideas you'd go and make this available to researchers, which are about the only ones that care about the old crackly things.
Starting point is 00:29:15 It's not a money issue. Most of these things have only been listened to by researchers about a hundred times. If you were to pay full Spotify rates, it would be about $10. Yet they're suing for $600 million. Why? So, Professor Grimelman, that $600 million, is that what constitutes the threat of putting the Internet Archive out of business? Yes, it is.
Starting point is 00:29:43 Copyright has something called statutory damages, where the court is authorized to award up to potentially $150,000, even without proof that the defendant made that much money or the plaintiff lost that much money. It's just meant to be a kind of deterrent. And when you multiply $150,000 by thousands of recordings,
Starting point is 00:30:05 you get up into the high millions, hundreds of millions, very quickly. So let me ask you, Professor Grimmelman, the same question where I started with Brewster at the top of the hour. Imagine for a moment that given the precedent of the publisher's case, that if this Universal Music Group case goes against the Internet Archive as well, and they are forced to pay hundreds of millions of dollars, which they can't, and they have to shut down. Hopefully there's other alternatives,
Starting point is 00:30:33 but if they had to shut down, what would we lose in your mind? Well, the other alternatives are going to be what the publishers would call pirate sites. They're going to be people who make archives, like, completely illegally. They're going to people who do it without the professional standards of actually trying to curate and orient. organize these large masses of material. We're going to have a huge morass of stuff out there, which will be polluted and overrun with advertising and malware and deep fakes.
Starting point is 00:31:08 We're going to have a greater disproportion of stuff that's ephemeral, rather than the enduring historical classics. So it's not like they're going to stop piracy. They're just going to make our past more confusing, messier, and harder to access. So Brewster, are you prepared? for the possibility that you have to do something with these materials or that, like, do you have a plan for what you might do if the Internet Archive does not have a future? You'll always come away with a book if you talk to a librarian. There's a wonderful book called The Library, a fragile history.
Starting point is 00:31:45 And it basically says, what happens to libraries? And it starts with the old Acadian libraries and Library of Alexandria. But there's all these libraries in between. And what happens to libraries is they're destroyed. And they tend to be actively destroyed by the powerful. And it used to be church and government. And now it's corporations and government. And what libraries have generally done is they've tried to work with each other, but a lot is lost.
Starting point is 00:32:13 James Grimman, I'll let you have the last word. I mean, there are other major organizations I think of the Library of Congress. Is there a way for that institution whose duty is to document the history of this country? Do they have a role here to be a container for this information? They could or should. There was an effort to create a digital public library of America a few years ago. there is an absolute need for this to happen. The Internet Archive has stepped up to fill some of our archival needs,
Starting point is 00:32:45 but there's much more than they can do. Well, James Grimmelman is a professor of digital and information law at Cornell Tech and Cornell Law School. Professor Grimmelman, thank you so much for joining us. My pleasure. And Brewster Kale, founder and director of the Internet Archive, Brewster, it has been a great pleasure to speak with you. Thank you so much. Thank you.
Starting point is 00:33:06 This is On Point. That was Megna Chakra Barty and On Point from WBUR. If you want more of that, you can subscribe to On Point the same way you've subscribed to Endless Thread. See you next week.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.