Many AIs train on enormous datasets of content. They draw on huge troves of images for art apps and oceans of text for chatbots. These datasets can sprawl across the internet, from Wikipedia and discussion threads to personal websites, paywalled materials, artist pages, and more.
Some creators oppose having their material play a role in such AI training, especially without permission or compensation. There have been protests, open letters, and several lawsuits to date (10, according to Pamela Samuelson on the Lawfare podcast). As that Authors Guild letter puts it,
Generative AI technologies built on large language models owe their existence to our writings. These technologies mimic and regurgitate our language, stories, style, and ideas. Millions of copyrighted books, articles, essays, and poetry provide the “food” for AI systems, endless meals for which there has been no bill. You’re spending billions of dollars to develop AI technology. It is only fair that you compensate us for using our writings, without which AI would be banal and extremely limited.
What would it take for academics to do the same? For some or many of us to turn against AI scraping and take action accordingly?
Consider: college and university faculty, staff, and students generate a lot of content. Research (papers, books, datasets, gray literature) appears in scholarly journals, as ebooks, on governmental websites, and in a range of digital archives and collections. Teaching materials reach the web through learning management systems/virtual learning environments, ebooks, websites personal and institutional, social media, and more. Informal publications created by academics can appear anywhere the internet functions, from YouTube and Tiktok to Wikipedia entries, podcasts, Discord servers, game archives, and more. Various AI platforms can hoover this up. So can data-gatherers making material available for purchase.
Why resist AI scraping this stuff? Academics could decide that they didn’t want their materials to play a role in AI training for a variety of reasons. They might see scraping as a copyright violation and want at least the option of permitting and even getting paid. They might view generative AI as reproducing prejudice, as threatening culture, as aiding businesses in laying off staff. As one put it on Mastodon,
How to Use Generative AI Tools That Were Built Entirely On Stolen Data To Put Other People Out of a Job and Probably Ruin Entire Industries
But Not Mine
He He
Surely Not Mine
They could also consider scraping a violation of privacy.
Let’s assume for today’s post that some of us working in colleges and universities come to that anti-scraping conclusion. How might things play out?
We could see opposition occur in a technical way. A content server (campus IT, a publisher, some other third party) can block spidering through several ways, including posting a robots.txt file, changing server settings, setting up registration barriers or paywalls, or blocking pings from certain addresses. If the LMS/VLE is safe fro being spidered, then we might push more content into those siloes. We could also see academics asking third parties, such as publishers, to take this step for their work.
Taking such a step would involve some tricky local politics, of course, involving academics with competing goals (“I want to get academic research out to the world, openly!”) and turf battles. Individuals could stake out anti- or pro-AI stances as career moves or as expressions of their professional identity. Members of a unit (department, office, sorority) might seek to convince that group to adopt an anti-AI stance. Units, divisions, entire institutions might consider policies about AIs scraping their content. Conferences might debate picking up similar policies. I can imagine anti-AI memes and simpler images appearing on websites and social media pages.
Other than asking, there are other approaches available to academics. A given professor might drive a legal threat against an AI firm, or either join or launch a lawsuit against the scraping entities to get them to stop. At the policy level faculty, staff, or students might lobby in various forms for governments to implement policies protecting their content from AI usage. Professional associations could be good vehicles for this.
Beyond these technical and operational dimensions of AI opposition, I’m interested in what it would take mentally for significant numbers of faculty, staff, and students to adopt such an attitude. To begin with, how many already feel that AIs using their content is a problem? I haven’t seen polling on this yet, but am very curious. Perhaps prominent stories of higher education content appearing in AI outputs in a bad light would encourage the opposition to grow. We might also be swayed by organized anti-AI campaigns which take place on public platforms, through peer networks, or in professional groups.
Those could be directed at academics, or we might feel part of a broader appeal, such as to creators, or anyone contributing content to the social web. In turn, academics could throw their support to larger efforts as public intellectuals or community members. After all, as I’ve said, we could see big cultural or political movements around AI.
Now, these aren’t the only attitudes we might take towards AI scraping, of course. Some academic creators might just prefer some acknowledgement of their work being used, and don’t require permission or payment. The Creative Commons CC-BY license gives neat shape to this preference. Others could be happy just to see their materials in the world, playing a role in other people’s education and research. That’s a more public domain orientation. If these stances sound familiar, that’s because they echo decades of arguments about how to present academic content to the web. Depending on how generative AI plays out, we might be due for another round of this chestnut.
It’s possible that outside entities will make the AI scraping decision for academia. As I’ve said before, policymakers could easily determine that generative AI is a threat and regulate it into a very small thing, or out of existence. A court could rule against Google, Meta, OpenAI, et al, ordering them to stop gathering data, to stop using data, or even, as professor Samuelson spectulates, to destroy their models. Staff, students, and faculty who expressed anti-scarping views would feel vindicated. On the other hand, generative AI could survive and grow, continuing to ingest swarms of digital content, and giving rise to further academic debates.
Are you seeing any such arguments in your field or institution? Are we going to see colleges and universities turn against AI on this score, or go along with it?
(thanks to Stephen Downes and other friends on Mastodon for conversation; I’m having a hard time linking to some of their individual posts, though)
I really enjoyed the way you wrote this out with lots of different possibilities. I personally think that AI will be viewed as needed service. If it can be freely accessed on the Internet, then it will be viewed as something that can be incorporated into the AI. The argument will be made that if a person or institution doesn't want it scraped then they shouldn't make it freely available (paywall required). I believe the courts won't view it as a direct derivative work and will instead view it as a general evolution and allowed use.
Wisdom In - Wisdom Out. Priceless.