A group of prominent intellectual property law professors has weighed in on the high-stakes AI copyright battle between several authors and Meta. In an amicus brief, the scholars argue that using copyrighted content as training data can be considered fair use under U.S. copyright law, if the goal is to create a new and ‘transformative’ tool. This suggests that fair use could potentially apply to Meta’s training process, even if the underlying data was obtained without permission.
This case has a clear piracy angle, as Meta used BitTorrent to download archives of pirated books to use as training material. Notably, the authors argue that, in addition to copying pirated books from Anna’s Archive and Z-Library, in the same process Meta also uploaded pirated books to third parties.
This week, a group of IP Law Professors submitted a “friend of the court” or amicus brief, backing Meta’s fair use defense. The professors, including scholars from Harvard, Emory, Boston University, and Santa Clara University, have different views on the impact of AI but are united in their copyright stance.
The brief stresses that Meta’s alleged use of pirated books as training data can be considered fair use. The source of the training data is not determinative, as long as it’s used to create a new and transformative product, they argue.
“The case law, including binding circuit precedent, holds that internal copying, made in the course of creating new knowledge, is a transformative use that is heavily favored by fair use doctrine,” the professors write.
The professors’ argument is centered around the concept of “transformative use.” They note that using books outside their original ‘reading’ purpose to create an AI model, transforms the purpose of the use. This internal copying, they argue, falls into a category courts have consistently recognized as fair use, also known as “non-expressive use”.
The amicus brief cites several cases to back up their line of reasoning. This includes the Perfect 10 v. Amazon lawsuit, where the Ninth Circuit found that it was fair use when Google created thumbnails using images copied from unauthorized “pirate” sites, because the resulting image search tool was transformative.
The authors cited conflicting cases, but the professors note that cases where fair use was denied typically involved copyright infringement related to personal consumption, rather than use of content to create something new.
The brief distinguishes this case from those cited by the plaintiffs, which involved unauthorized copying for direct consumptive use (e.g., downloading for personal enjoyment). In contrast, Meta’s internal copies were allegedly not perceived by humans but used to build a new tool.
“Fair use, like copyright as a whole, ‘is not a privilege reserved for the well behaved’,” the brief notes. “Fair use doctrine should focus on the consequences of a ruling for knowledge and expression. Other considerations should be left for other legal regimes.”
Other countries, including Japan, have reportedly crafted exceptions in their law to allow tech companies to train LLMs on copyrighted material, without permission.
The U.S. has no such exceptions, but the professors urge the court to consider fair use. As the VCR and other innovations showed, copyright shouldn’t stand in the way of new tools and developing technologies.
LLMs don’t ’remember the entire contents of each book they read’. The data are used to train the LLMs predictive capabilities for sequences of words (or more accurately, tokens). In a sense, it develops of lossy model of its training data not a literal database. LLMs use a stochastic process which means you’ll get different results each time you ask any given question, not deterministic regurgitation of ‘read texts’. This is why it’s a transformative process and also why LLMs can hallucinate nonsense.
This stuff is counter-intuitive. Below is a very good, in-depth explanation that really helped me get a sense of how these things work. Highly recommended if you can spare the 3 hours (!):
https://www.youtube.com/watch?v=7xTGNNLPyMI&list=PLMtPKpcZqZMzfmi6lOtY6dgKXrapOYLlN