Molly White's Citation Needed - “Wait, not like that”: Free and open access in the age of generative AI
Episode Date: March 14, 2025The real threat isn’t AI using open knowledge — it’s AI companies killing the projects that make knowledge free. Originally published on March 14, 2025....
Transcript
Discussion (0)
I'm Molly White, and you're listening to the audio feed for the citation-needed newsletter.
You can see the text version of the newsletter online at citation-needed.news.
Wait, not like that.
Free and open access in the age of generative AI.
The real threat isn't AI using open knowledge.
It's AI companies killing the projects that make knowledge-free.
This issue was originally published on March 14, 2025.
The visions of the Open Access Movement have inspired countless people to contribute their work to the commons,
a world where, quote, every single human being can freely share in the sum of all knowledge, in the case of Wikimedia,
and where, quote, education, culture, and science are equitably shared as a means to benefit humanity,
in the case of Creative Commons.
But there are scenarios that can introduce doubt for those who contribute to free and open projects like the Wikimedia
projects or who independently release their own works under free licenses. I call these,
wait, no, not like that, moments. When a passionate Wikipedia discovers their carefully researched
article has been packaged into an e-book and sold on Amazon for someone else's profit,
wait, no, not like that. When a developer of an open-source software project sees a multi-billion
dollar tech company rely on their work without contributing anything back, wait, no, not like that.
When a nature photographer discovers their freely licensed wildlife photo was used in an NFT collection minted on an environmentally destructive blockchain.
Wait, no, not like that.
And perhaps most recently, when a person who publishes their work under a free license discovers that work has been used by tech megagions to train extractive, exploitive, large language models, wait, no, not like that.
These reactions are understandable.
When we freely license our work, we do so in the service of those goals, free and open access to knowledge and education.
But when trillion-dollar companies exploit that openness while giving nothing back, or when our work enables harmful or exploitative uses,
it can feel like we've been naive.
The natural response is to try to regain control.
This is where many creators find themselves today, particularly in response to AI training.
But the solutions they're reaching for, more restrictive licenses, paywalls, or not publishing at all,
risk destroying the very commons they originally set out to build.
The first impulse is often to try to tighten the licensing,
maybe by switching away to something like the Creative Commons non-commercial and thus non-free license.
When NFTs enjoyed a moment of popularity in the early 2020s,
some artists look to Creative Commons in hopes that they might declare NNFREA.
FTs fundamentally incompatible with their free licenses.
They didn't.
The same thing happened again with the explosion of generative AI companies training models on CC-licensed works,
and some were disappointed to see Creative Commons take the stance that not only do CC licenses
not prohibit AI training wholesale, AI training should be considered non-infringing by default
from a copyright perspective.
But the trouble with trying to continually narrow the definitions of free is that it is
impossible to write a license that will perfectly prohibit each possibility that makes a person go,
wait, no, not like that, while retaining the benefits of free and open access. If that is truly
what a creator wants, then they are likely better served by a traditional all-rights-reserved model,
in which any prospective re-user must individually negotiate terms with them. But this undermines the
purpose of free, and restricts permitted reuse only to those with the time, means, and bargaining
power to negotiate on a case-by-case basis. And particularly with AI, there's also no indication
that tightening the license even works. We already know that major AI companies have been training
their models on all rights reserved works in their ongoing efforts to ingest as much data as
possible, and such training may prove to have been permissible in U.S. courts under fair use, and
it's probably best that it does. There's also been an impulse by creators concerned about AI to
dramatically limit how people can access their work. Some artists have decided it's simply not
worthwhile to maintain an online gallery of their work when that makes it easily accessible for
AI training. Many have implemented restrictive content gates, paywalls, registration walls,
are you a human walls, and similar, to try to fend off scrapers. This
too closes off the commons, making it more challenging or expensive for those every single human
beings described in open access manifestos to access the material that was originally intended
to be common goods. Often by trying to wall off those considered to be bad actors, people wall off
the very people they intended to give access to. People who gate their work behind paywalls
likely didn't set out to create works that only the wealthy could access. People who implement
registration walls probably didn't intend for their work to only be available to those willing
to put up with a risk of incessant email spam after they relinquish their personal information.
People who try to stave off bots with CAPTCHAs asking, are you a human, probably didn't mean
to limit their material only to abled people who are willing to abide ever more protracted
and irritating riddles. And people using any of these strategies likely didn't want people to
struggle to even find their work in the first place after the paywalls and reg walls and anti-bought
mechanisms thwarted search engine indexers or social media previews. And frankly, if we want to create a
world in which every single human being can freely share in the sum of all knowledge, and where education,
culture, and science are equitably shared as a means to benefit humanity, we should stop attempting
to erect these walls. If a kid learns that carbon dioxide traps heat in Earth's atmosphere,
or how to calculate compound interest thanks to an editor's work on a Wikipedia article?
Does it really matter if they learned it via chat GPT or by asking Siri or from opening a browser
and visiting Wikipedia.org?
Instead of worrying about wait, not like that, I think we need to reframe the conversation
to wait not only like that, or wait, not in ways that threaten open access itself.
The true threat from AI models training on open access material is not.
not that more people may access knowledge thanks to these new modalities.
It's that these models may stifle Wikipedia and other free knowledge repositories,
benefiting from the labor, money, and care that goes into supporting them,
while also bleeding them dry.
It's that trillion-dollar companies become the sole arbiters of access to knowledge
after subsuming the painstaking work of those who make knowledge free to all,
killing those projects in the process.
irresponsible AI companies are already imposing huge loads on Wikimedia infrastructure,
which is costly both from a pure bandwidth perspective,
but also because it requires dedicated engineers to maintain and improve systems to handle the massive automated traffic.
And AI companies that do not attribute their responses or otherwise provide any pointers back to Wikipedia,
prevent users from knowing where that material came from,
and do not encourage those users to go visit Wikipedia,
where they might then sign up as an editor or donate after seeing a request for support.
This is most AI companies, by the way.
Many so-called AI visionaries seem perfectly content to promise that artificial superintelligence is just around the corner,
but claim that attribution is somehow a permanently unsolvable problem.
And while I rely on Wikipedia as an example here,
the same goes for any website containing freely licensed material,
where scraping benefits AI companies at,
often extreme cost to the content hosts.
This isn't just about strain on one individual project, it's about the systematic dismantling
of the infrastructure that makes open knowledge possible.
Anyone at an AI company who stops to think for half a second should be able to recognize they
have a vampiric relationship with the commons.
While they rely on these repositories for their sustenance, their adversarial and disrespectful
relationships with creators reduce the incentives for anyone to make their work possible
publicly available going forward, freely licensed or otherwise.
They drain resources from maintainers of these common repositories,
often without any compensation.
They reduce the visibility of the original sources,
leaving people unaware that they can or should contribute
towards maintaining such valuable projects.
AI companies should want a thriving open access ecosystem,
ensuring that the models they trained on Wikipedia in 2020
can be continually expanded and updated.
Even if AI companies don't care about the benefit to the common good, it shouldn't be hard
for them to understand that by bleeding these projects dry, they are destroying their own food
supply.
And yet, many AI companies seem to give very little thought to this, seemingly looking only
at the months in front of them, rather than operating on years-long timescales.
Though perhaps anyone who has observed AI companies' activities more generally will be unsurprised
to see that they do not act as though they believe their businesses will be seen.
sustainable in the order of years. It would be very wise for these companies to immediately
begin prioritizing the ongoing health of the commons so that they do not wind up strangling their
golden goose. It would also be very wise for the rest of us to not rely on AI companies to suddenly,
miraculously come to their senses or develop a conscience en masse. Instead, we must ensure that
mechanisms are in place to force AI companies to engage with these repositories on their creators,
terms. There are ways to do it. Models like Wikimedia Enterprise, which welcomes AI companies to use
Wikimedia-hosted data, but requires them to do so using paid, high-volume pipes to ensure they don't
clog up the system for everyone else, and to make them financially support the extra load they're
placing on the project's infrastructure. Creative Commons is experimenting with the idea of preference
signals, a non-copyright-based model by which to communicate to AI companies and other entities,
the terms on which they may or may not reuse CC-licensed work.
Everyday people need to be given the tools, both legal and technical, to enforce their own
preferences around how their works are used. Some might argue that if AI companies are already
ignoring copyright and training on all rights-reserved works, they'll simply ignore these mechanisms
too. But there's a crucial difference. Rather than relying on murky copyright claims or threatening to expand
copyright in ways that would ultimately harm creators, we could establish clear legal frameworks around
consents and compensation that build on existing labor and contract law. Just as unions have successfully
negotiated terms of use, ethical engagement, and fair compensation in the past, collective bargaining can
help establish enforceable agreements between AI companies, those freely licensing their works,
and communities maintaining open knowledge repositories. These agreements would cover not just
financial compensation for infrastructure costs, but also requirements around attribution,
ethical use, and reinvestment in the commons. The future of free and open access isn't about
saying, wait, not like that. It's about saying, yes, like that, but under fair terms. With
fair compensation for infrastructure costs, with attribution and avenues by which new people can
discover and give back to the underlying commons, with deep respect for the communities that make
the commons and the tools that build off of them possible. Only then can we truly build the
world where every single human being can freely share in the sum of all knowledge. A final note.
As I was writing this piece, I discovered that a South by Southwest panel featuring delegates
from the Wikimedia Foundation in Creative Commons, titled Openness Under Pressure, Navigating the Future of Open Access,
discussed some of the same topics. I was sadly scheduled to speak at the same time and so was unable to attend in person.
The audio recording is available online and I would highly recommend giving it a listen if this is a topic that interests you.
Thanks for listening to this issue of the citation needed newsletter.
If you would like to support my work with a free or pay what you want,
subscription to the citation needed newsletter, or if you would like to receive these issues in your
email, go to citation needed.news. sign up. If you enjoyed the podcast version of this episode,
please consider leaving a rating or review in your podcast player of choice.
