An AI engine scans a book. Is that copyright infringement or fair use?

Sign up for The Media Today, CJR’s daily newsletter.

As artificial intelligence programs have become ubiquitous over the past year, so have lawsuits from authors and other creative professionals who argue that their work has been essential to that ubiquity—the “large language models” (or LLMs) that power text-generating AI tools are trained on content that has been scraped from the Web, without its authors’ consent—and that they deserve to be paid for it. Last week, my colleague Yona Roberts Golding wrote about how media outlets, specifically, are weighing legal action against companies that offer AI products, including OpenAI, Meta, and Google. They may have a case: a 2021 analysis of a dataset used by many AI programs showed that half of its top ten sources were news outlets. As Roberts Golding noted, Karla Ortiz, a conceptual artist and one of the plaintiffs in a lawsuit against three AI services, recently told a roundtable hosted by the Federal Trade Commission that the creative economy only works “when the basic tenets of consent, credit, compensation, and transparency are followed.”

As Roberts Golding pointed out, however, AI companies maintain that their datasets are protected by the “fair use” doctrine in copyright law, which allows for copyrighted work to be repurposed under certain limited conditions. Matthew Butterick, Ortiz’s lawyer, told Roberts Golding that he is not convinced by this argument; LLMs are “being held out commercially as replacing authors,” he said, noting that AI-generated books have already been sold on Amazon, under real or fake names. Most copyright experts would probably agree that duplicating a book word for word isn’t fair use. But some observers believe that the scraping of books and other content to train LLMs likely is protected by the fair use exception—or, at least, that it should be. In any case, debates around news content, copyright, and AI are building on similar debates around other types of creative content—debates that have been live throughout AI’s recent period of rapid development, and that build on much older legal concepts and arguments.

Determining whether LLMs training themselves on copyrighted text qualifies as fair use can be difficult even for experts—not just because AI is complicated, but because the concept of fair use is, too. According to a 1990 Supreme Court ruling, the doctrine was initially intended to counterbalance decisions under the Copyright Act of 1976 that might inadvertently “stifle the very creativity which that law is designed to foster.” The US Copyright Office notes that the Act lists certain types of activity—such as criticism, comment, and news reporting—as examples that qualify under the exemption. But judges deciding such cases have to take into account four separate and in some cases competing factors: the purpose of the use and whether it is “transformative,” the nature of the copyrighted work, the amount of the work used, and what effect the use has on the market for the original.

The courts are more likely to find that nonprofit uses are fair, but the Copyright Office notes that this doesn’t mean that all nonprofit uses are fair and that all commercial uses are not. And, while the use of short excerpts of original works is more likely to be fair, some courts have found the use of an entire work to be fair if that use is seen as transformative—that is to say, if it has added something meaningfully new or used the work in a different way to that which was initially intended. When it comes to AI, the heart of the issue is the debate over what exactly an LLM does. Does it copy entire books in order to reproduce them? Or does it simply add the words in those books to its database, in order to answer questions and generate new content?

Earlier this year, Matthew Sag, a law professor at Emory University, told a US Senate subcommittee that technically, AI engines do not copy original works but rather “digest” them, in order to learn how human language functions. Rather than thinking of an AI engine as copying a book “like a scribe in a monastery,” Sag said, it makes more sense to think of it as learning from the data, like a student would. Joseph Paul Cohen, director of a file-sharing service called Academic Torrents, told Wired recently that great authors typically read the books that came before theirs. “It seems weird that we would expect an AI author to only have read openly licensed works,” Cohen said.

If we see LLMs as merely adding content to a database in order to generate better results for users, this would seem very similar to how search engines like Google work. And Google has won two important copyright cases that seem relevant to the AI debate. In 2006, the company was sued by Perfect 10, an adult entertainment site that claimed Google had infringed its copyright by generating thumbnail photos of its content; the court ruled that providing images in a search index was “fundamentally different” from simply creating a copy, and that in doing so, Google had provided “a significant benefit to the public.” In the other case, the Authors’ Guild, a professional organization that represents the interests of writers, sued Google for scanning more than twenty million books and showing short snippets of text when people searched for them. In 2013, a judge in that case ruled that Google’s conduct constituted fair use because it was transformative.

In 2019, the US Patent and Trademark Office requested input on questions around intellectual property protection and AI. OpenAI responded that it believes that the training of AI systems counts as a “highly transformative” use of copyrighted works, because the latter are meant for human consumption whereas the training of AI engines is a “non-expressive” activity aimed at helping software learn the patterns in language. Although the company conceded that its software scans entire works, it argued that the more important question is how much of a work is shown to a user. This argument was also a factor in the Google Books case.

Perhaps unsurprisingly, the Copyright Alliance, a nonprofit group that represents authors and other creative professionals, has taken issue with comparisons between Google’s scanning of books and AI engines’ training of LLMs: unlike the former, the group says, the latter does not make provisions to acknowledge “factual information about the copyrighted works” and link out to where users can find them. Instead, the Alliance argues that most generative AI programs “reproduce” expressive elements from copyrighted works, thereby creating new texts that often act as “market substitutes” for the originals they were trained on. Several of the recent copyright suits against AI services have referred to Books3, a large open-source database that Shawn Presser, an independent AI researcher, created from a variety of online sources, including so-called “shadow libraries” that host links to pirated versions of books and periodicals. Presser has argued, in response, that deleting databases like his risks creating a world in which only billion-dollar tech companies with big legal budgets can afford to create AI models.

According to a recent analysis by Alex Reisner in The Atlantic, the fair-use argument for AI generally rests on two claims: that generative-AI tools do not replicate the books they’ve been trained on but instead produce new works, and that those new works “do not hurt the commercial market for the originals.” Jason Schultz, the director of the Technology Law and Policy Clinic at New York University, told Reisner that there is a strong argument that OpenAI’s work meets both of these criteria. Elsewhere, Sy Damle, a former general counsel at the US Copyright Office, told a House subcommittee earlier this year that he believes the use of copyrighted work for AI training is categorically fair (though another former counsel from the same agency disagreed). And Mike Masnick of Techdirt has argued that the legality of the original material is irrelevant. If a musician were inspired to create new music after hearing pirated songs, Masnick asks, would that mean that the new songs infringe copyright?

As Reisner notes, some observers are concerned that AI indexing will change the incentives of the existing copyright system. If an AI program can scrape copyrighted works and turn out something in a similar style, artists could be less likely to create new works. But some authors seem sanguine about the prospect that their works will be scraped by AI. “Would I forbid the teaching (if that is the word) of my stories to computers?” Stephen King asked recently. “Not even if I could. I might as well be King Canute, forbidding the tide to come in. Or a Luddite trying to stop industrial progress by hammering a steam loom to pieces.”

Going forward, will news organizations try to hammer the loom? Some companies see licensing their content to train AI engines as a potential way forward; the Associated Press, for example, recently announced a deal with OpenAI. But as Roberts Golding notes, others have started protecting their websites from scraping tools, and rumors are circulating that several top media companies could soon bring big AI firms to court. Mehtab Khan, a lawyer and legal scholar at Harvard’s Berkman Klein Institute, told Roberts Golding that suing would be a gamble—defeat could set the precedent that AI training constitutes fair use. When it comes to AI and copyright more generally, Khan said, the key is to find “a balance that would allow the public to have access, but would also address some of the anxieties that artists and creative industries have about their words being used without consent, and without compensation.”

Other notable stories:

Yesterday, Wael al-Dahdouh, the Gaza bureau chief for the Arabic-language arm of Al Jazeera, was reporting on air when he learned that at least four members of his family—his wife, son, daughter, and grandson—had been killed, apparently in an Israeli air strike that hit a refugee camp where they were sheltering. Elsewhere, Axios reported that Antony Blinken, the US secretary of state, asked the government of Qatar to “tone down” the rhetoric about the conflict on Al Jazeera, which it funds. (Al Jazeera has insisted that it is editorially independent; Israeli officials are currently pushing to ban its operations inside Israel, as we reported on Tuesday.) And Israel said that it briefly sent tanks into Gaza overnight after Prime Minister Benjamin Netanyahu gave a televised address signaling that a broader incursion is imminent, without offering any specifics.
A gunman in Maine is believed to have killed at least sixteen people and injured dozens more in shootings at a bowling alley and a bar in the Lewiston area; as of this morning, the gunman was still at large and the local area was on lockdown. The Lewiston Sun Journal and Portland Press Herald dropped their paywalls from coverage of the shooting; you can follow their reporting on the evolving situation here. “I’m not anxious to wake up Thursday to a growing list of victims who are inevitably going to be tied to people I know and perhaps include some I do actually know,” Steve Collins, a reporter at the Sun Journal, tweeted overnight. “This isn’t a big community. That helps it come together in a crisis. And we will this time, too, with heavy hearts.”
After weeks of failing to elect a Speaker following the ouster of Kevin McCarthy, House Republicans finally elevated Mike Johnson—a little-known hard-right congressman from Louisiana who led efforts to overturn Donald Trump’s 2020 election loss—to the post. Johnson’s rise was so sudden that the news media did not have much time to scrutinize his record before he was voted in, but that’s about to change; as Politico notes, “every word Johnson has ever uttered or written is about to come under scrutiny like never before.” Reporters—and opposition researchers—may have a lot of material to mine: Johnson is a former talk-radio host who has also hosted a podcast with his wife.
A pair of congressmen—Jim McGovern, a Democrat, and Thomas Massie, a Republican—are preparing to write to the Biden administration urging an end to its push to extradite the WikiLeaks founder Julian Assange from the UK to face charges, including under the Espionage Act, related to the publication of leaked US military secrets in the early 2010s. The effort comes amid a visit to the US by Anthony Albanese, the prime minister of Australia (where Assange is from), who has called for a resolution to the case. (We wrote about the wider Australian reaction to the case last month.)
And in the UK, reporters at the Financial Times found that a new book by Rachel Reeves—the economy spokesperson for the opposition Labour Party, which is well-placed to win elections slated for next year—contains examples of “apparent plagiarism,” including entire passages that appear to have been lifted from sources including The Guardian and Wikipedia. Reeves rejected allegations of plagiarism but did acknowledge “inadvertent mistakes” and pledged to rectify them in future editions.

ICYMI: The potential and peril of AI in the newsroom

Mathew Ingram was CJR’s longtime chief digital writer. Previously, he was a senior writer with Fortune magazine. He has written about the intersection between media and technology since the earliest days of the commercial internet. His writing has been published in the Washington Post and the Financial Times as well as by Reuters and Bloomberg.

An AI engine scans a book. Is that copyright infringement or fair use?

About

Support CJR

Advertise