Molly White's Citation Needed - “Wait, not like that”: Free and open access in the age of generative AI

Episode Date: March 14, 2025

The real threat isn’t AI using open knowledge — it’s AI companies killing the projects that make knowledge free. Originally published on March 14, 2025....

Transcript
Discussion (0)
Starting point is 00:00:01 I'm Molly White, and you're listening to the audio feed for the citation-needed newsletter. You can see the text version of the newsletter online at citation-needed.news. Wait, not like that. Free and open access in the age of generative AI. The real threat isn't AI using open knowledge. It's AI companies killing the projects that make knowledge-free. This issue was originally published on March 14, 2025. The visions of the Open Access Movement have inspired countless people to contribute their work to the commons,
Starting point is 00:00:41 a world where, quote, every single human being can freely share in the sum of all knowledge, in the case of Wikimedia, and where, quote, education, culture, and science are equitably shared as a means to benefit humanity, in the case of Creative Commons. But there are scenarios that can introduce doubt for those who contribute to free and open projects like the Wikimedia projects or who independently release their own works under free licenses. I call these, wait, no, not like that, moments. When a passionate Wikipedia discovers their carefully researched article has been packaged into an e-book and sold on Amazon for someone else's profit, wait, no, not like that. When a developer of an open-source software project sees a multi-billion
Starting point is 00:01:26 dollar tech company rely on their work without contributing anything back, wait, no, not like that. When a nature photographer discovers their freely licensed wildlife photo was used in an NFT collection minted on an environmentally destructive blockchain. Wait, no, not like that. And perhaps most recently, when a person who publishes their work under a free license discovers that work has been used by tech megagions to train extractive, exploitive, large language models, wait, no, not like that. These reactions are understandable. When we freely license our work, we do so in the service of those goals, free and open access to knowledge and education. But when trillion-dollar companies exploit that openness while giving nothing back, or when our work enables harmful or exploitative uses, it can feel like we've been naive.
Starting point is 00:02:20 The natural response is to try to regain control. This is where many creators find themselves today, particularly in response to AI training. But the solutions they're reaching for, more restrictive licenses, paywalls, or not publishing at all, risk destroying the very commons they originally set out to build. The first impulse is often to try to tighten the licensing, maybe by switching away to something like the Creative Commons non-commercial and thus non-free license. When NFTs enjoyed a moment of popularity in the early 2020s, some artists look to Creative Commons in hopes that they might declare NNFREA.
Starting point is 00:02:59 FTs fundamentally incompatible with their free licenses. They didn't. The same thing happened again with the explosion of generative AI companies training models on CC-licensed works, and some were disappointed to see Creative Commons take the stance that not only do CC licenses not prohibit AI training wholesale, AI training should be considered non-infringing by default from a copyright perspective. But the trouble with trying to continually narrow the definitions of free is that it is impossible to write a license that will perfectly prohibit each possibility that makes a person go,
Starting point is 00:03:35 wait, no, not like that, while retaining the benefits of free and open access. If that is truly what a creator wants, then they are likely better served by a traditional all-rights-reserved model, in which any prospective re-user must individually negotiate terms with them. But this undermines the purpose of free, and restricts permitted reuse only to those with the time, means, and bargaining power to negotiate on a case-by-case basis. And particularly with AI, there's also no indication that tightening the license even works. We already know that major AI companies have been training their models on all rights reserved works in their ongoing efforts to ingest as much data as possible, and such training may prove to have been permissible in U.S. courts under fair use, and
Starting point is 00:04:24 it's probably best that it does. There's also been an impulse by creators concerned about AI to dramatically limit how people can access their work. Some artists have decided it's simply not worthwhile to maintain an online gallery of their work when that makes it easily accessible for AI training. Many have implemented restrictive content gates, paywalls, registration walls, are you a human walls, and similar, to try to fend off scrapers. This too closes off the commons, making it more challenging or expensive for those every single human beings described in open access manifestos to access the material that was originally intended to be common goods. Often by trying to wall off those considered to be bad actors, people wall off
Starting point is 00:05:12 the very people they intended to give access to. People who gate their work behind paywalls likely didn't set out to create works that only the wealthy could access. People who implement registration walls probably didn't intend for their work to only be available to those willing to put up with a risk of incessant email spam after they relinquish their personal information. People who try to stave off bots with CAPTCHAs asking, are you a human, probably didn't mean to limit their material only to abled people who are willing to abide ever more protracted and irritating riddles. And people using any of these strategies likely didn't want people to struggle to even find their work in the first place after the paywalls and reg walls and anti-bought
Starting point is 00:05:56 mechanisms thwarted search engine indexers or social media previews. And frankly, if we want to create a world in which every single human being can freely share in the sum of all knowledge, and where education, culture, and science are equitably shared as a means to benefit humanity, we should stop attempting to erect these walls. If a kid learns that carbon dioxide traps heat in Earth's atmosphere, or how to calculate compound interest thanks to an editor's work on a Wikipedia article? Does it really matter if they learned it via chat GPT or by asking Siri or from opening a browser and visiting Wikipedia.org? Instead of worrying about wait, not like that, I think we need to reframe the conversation
Starting point is 00:06:40 to wait not only like that, or wait, not in ways that threaten open access itself. The true threat from AI models training on open access material is not. not that more people may access knowledge thanks to these new modalities. It's that these models may stifle Wikipedia and other free knowledge repositories, benefiting from the labor, money, and care that goes into supporting them, while also bleeding them dry. It's that trillion-dollar companies become the sole arbiters of access to knowledge after subsuming the painstaking work of those who make knowledge free to all,
Starting point is 00:07:17 killing those projects in the process. irresponsible AI companies are already imposing huge loads on Wikimedia infrastructure, which is costly both from a pure bandwidth perspective, but also because it requires dedicated engineers to maintain and improve systems to handle the massive automated traffic. And AI companies that do not attribute their responses or otherwise provide any pointers back to Wikipedia, prevent users from knowing where that material came from, and do not encourage those users to go visit Wikipedia, where they might then sign up as an editor or donate after seeing a request for support.
Starting point is 00:07:54 This is most AI companies, by the way. Many so-called AI visionaries seem perfectly content to promise that artificial superintelligence is just around the corner, but claim that attribution is somehow a permanently unsolvable problem. And while I rely on Wikipedia as an example here, the same goes for any website containing freely licensed material, where scraping benefits AI companies at, often extreme cost to the content hosts. This isn't just about strain on one individual project, it's about the systematic dismantling
Starting point is 00:08:27 of the infrastructure that makes open knowledge possible. Anyone at an AI company who stops to think for half a second should be able to recognize they have a vampiric relationship with the commons. While they rely on these repositories for their sustenance, their adversarial and disrespectful relationships with creators reduce the incentives for anyone to make their work possible publicly available going forward, freely licensed or otherwise. They drain resources from maintainers of these common repositories, often without any compensation.
Starting point is 00:08:59 They reduce the visibility of the original sources, leaving people unaware that they can or should contribute towards maintaining such valuable projects. AI companies should want a thriving open access ecosystem, ensuring that the models they trained on Wikipedia in 2020 can be continually expanded and updated. Even if AI companies don't care about the benefit to the common good, it shouldn't be hard for them to understand that by bleeding these projects dry, they are destroying their own food
Starting point is 00:09:29 supply. And yet, many AI companies seem to give very little thought to this, seemingly looking only at the months in front of them, rather than operating on years-long timescales. Though perhaps anyone who has observed AI companies' activities more generally will be unsurprised to see that they do not act as though they believe their businesses will be seen. sustainable in the order of years. It would be very wise for these companies to immediately begin prioritizing the ongoing health of the commons so that they do not wind up strangling their golden goose. It would also be very wise for the rest of us to not rely on AI companies to suddenly,
Starting point is 00:10:06 miraculously come to their senses or develop a conscience en masse. Instead, we must ensure that mechanisms are in place to force AI companies to engage with these repositories on their creators, terms. There are ways to do it. Models like Wikimedia Enterprise, which welcomes AI companies to use Wikimedia-hosted data, but requires them to do so using paid, high-volume pipes to ensure they don't clog up the system for everyone else, and to make them financially support the extra load they're placing on the project's infrastructure. Creative Commons is experimenting with the idea of preference signals, a non-copyright-based model by which to communicate to AI companies and other entities, the terms on which they may or may not reuse CC-licensed work.
Starting point is 00:10:54 Everyday people need to be given the tools, both legal and technical, to enforce their own preferences around how their works are used. Some might argue that if AI companies are already ignoring copyright and training on all rights-reserved works, they'll simply ignore these mechanisms too. But there's a crucial difference. Rather than relying on murky copyright claims or threatening to expand copyright in ways that would ultimately harm creators, we could establish clear legal frameworks around consents and compensation that build on existing labor and contract law. Just as unions have successfully negotiated terms of use, ethical engagement, and fair compensation in the past, collective bargaining can help establish enforceable agreements between AI companies, those freely licensing their works,
Starting point is 00:11:44 and communities maintaining open knowledge repositories. These agreements would cover not just financial compensation for infrastructure costs, but also requirements around attribution, ethical use, and reinvestment in the commons. The future of free and open access isn't about saying, wait, not like that. It's about saying, yes, like that, but under fair terms. With fair compensation for infrastructure costs, with attribution and avenues by which new people can discover and give back to the underlying commons, with deep respect for the communities that make the commons and the tools that build off of them possible. Only then can we truly build the world where every single human being can freely share in the sum of all knowledge. A final note.
Starting point is 00:12:32 As I was writing this piece, I discovered that a South by Southwest panel featuring delegates from the Wikimedia Foundation in Creative Commons, titled Openness Under Pressure, Navigating the Future of Open Access, discussed some of the same topics. I was sadly scheduled to speak at the same time and so was unable to attend in person. The audio recording is available online and I would highly recommend giving it a listen if this is a topic that interests you. Thanks for listening to this issue of the citation needed newsletter. If you would like to support my work with a free or pay what you want, subscription to the citation needed newsletter, or if you would like to receive these issues in your email, go to citation needed.news. sign up. If you enjoyed the podcast version of this episode,
Starting point is 00:13:21 please consider leaving a rating or review in your podcast player of choice.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.