AI
July 8, 2025
By Emily Poler
Two recent court decisions are starting to provide some clarity about when AI companies can incorporate copyrighted works into their large language models (LLMs) without licenses from the copyright holders. One is in a suit against Meta; we’ll get to that in a future post.
Today, let’s focus on the suit brought by a group of authors against Anthropic PBC, the company behind Claude, a ChatGPT and CoPilot competitor. (For what it’s worth, I’ve found Claude to be the best AI of the three). Bottom line: “The training use was a fair use,” wrote Judge William Alsup. “The use of the books at issue to train Claude and its precursors was exceedingly transformative.” This ruling is a landmark as it’s one of the first substantive decisions on how fair use applies to AI — and it’s a big win for AI, right? Well, there’s a catch.
But first, some background. To create Claude (I love how AI companies give their LLMs these friendly, teddy bear names that mask that they’re machines and cause real harm), Anthropic collected a library of approximately seven million books. In some cases, Anthropic purchased hard copies and scanned them. But, mostly it just grabbed “free” (aka, pirated) digital copies from the Internet. At least three authors whose books were used — Andrea Bartz, Charles Graeber and Kirk Wallace Johnson — were not amused, and in 2024 they filed a class action suit against Anthropic, alleging copyright infringement for training Claude on their works and for obtaining the materials without paying for them.
As far as Anthropic’s training of its LLM on copyrighted materials, the Court found this to be fair use since it dramatically differs from the works’ original purpose. As the judge wrote, “the technology at issue was among the most transformative many of us will see in our lifetimes.” This is a big deal.
But what’s also a big deal — and the catch for Anthropic — is that if you’re going to train an AI on copyrighted materials, you have to pay for them. In most cases, Anthropic didn’t. And thus, Judge Alsup is allowing the case to proceed to trial, writing that Anthropic “downloaded for free millions of copyrighted books in digital form from pirate sites on the internet.”
For me, there are a couple of notable takeaways here, some purely legal and some the kind of common sense that I suspect that most kindergartners could point out. Let’s talk about the purely legal point first. The Court went to great lengths to distinguish the different ways that Anthropic used the works, which was critical in its fair use analysis.
As part of Anthropic’s process, when it scanned a purchased book it discarded the original copy. The Court found this constituted fair use as long as the hard copy was destroyed and the digitized version not distributed outside the company. However, Anthropic kept all the books, including the millions of pirated copies, in a general library even after deciding, in some cases, that some books in this library would not be used for training now, or maybe ever. The judge specifically noted this implied the company’s primary purpose was to amass a vast library without paying for it, regardless of whether it might someday be used for a transformative purpose, and that such a practice directly displaced legitimate demand for the authors’ works.
The opinion is especially interesting to me because of how the Court distinguished the facts of this case from other fair use cases. For example, the Court pointed out that in most (if not all) of other fair use cases, the defendant purchased or obtained the initial copy legally by either purchasing it or using a library copy.
This brings us to the other big takeaway, which is a mix of legal reasoning combined with morals and common sense: A defendant doesn’t get a free pass on stealing copyrighted materials just because it does something neat with those materials. In his opinion, the judge consistently ruled that it’s not ok to pirate books. This should have been obvious to Anthropic (and its lawyers) as I think that most children could tell you doing something cool or interesting with the proceeds of a bank robbery doesn’t make the bank robbery legal. This is particularly true given that Anthropic’s whole marketing schtick is that it’s less evil than other technology companies. In fact, Anthropic’s lawyers seemed to acknowledge as much at oral argument, saying “You can’t just bless yourself by saying I have a research purpose and, therefore, go and take any textbook you want. That would destroy the academic publishing market if that were the case.”
It will be fascinating to see what happens in the trial, slated to start in December. If judgement for copyright goes against Anthropic, U.S. copyright law allows for statutory damages of up to $150,000 per infringed work. With more than seven million pirated books in Anthropic’s library, the damages could be huge.
Also huge, of course, is the precedent set here that training AI on copyrighted works is fair use. It’s a significant decision that many have been waiting for that will have enormous repercussions on, well, just about everything going forward.
Stay tuned. More to come soon on the suit against LLAMA, Meta’s LLM.
May 27, 2025
By Emily Poler
Following up on my last post in which we, in part, discussed the United States Copyright Office and its developing standards for determining whether works created with artificial intelligence are eligible for copyright, we can now dive deeply into the recent “pre-publication” release of a report detailing that Office’s thinking on the topic of AI and fair use. A final version of the report is supposed to be published in the near future; however, the Trump Administration’s termination of the Director of the US Copyright Office, which, according to some, is linked to the Office’s issuance of this report, makes me wonder if this time frame might change.
Why is this important? Given the Copyright Office’s role as the government body to which someone who wants to register a copyright goes to file an application, as well as advising Congress on matters related to copyright, its reports are likely to influence the judges currently considering the lawsuits against the tech companies that own OpenAI and its ilk.
As a reminder, the fair use doctrine (some background here) allows use of copyrighted material under certain circumstances to be considered non-infringing (i.e., fair). Where a use is fair, the entity repurposing the copyrighted materials does not have to pay a license fee to the original creator. Courts have developed a four-part framework for determining if a new use furthers the ultimate goal of copyright — the promotion of the “Progress of Science and useful Arts,” U.S. Constit., Art. I, Section 8 (and yes, that capitalization is in the original). This involves considering things like the degree to which the new work transforms the original and the extent to which the new work can substitute for the original work.
Overall, the Copyright Office’s report is quite interesting, replete with good background for anyone wanting to understand generative AI, and to absorb the issues related to AI and copyright. Here are some highlights (and my thoughts):
- Different uses of copyrighted materials should be treated differently. (Okay, that’s maybe not so surprising). For example, in the Copyright Office’s analysis, using copyrighted materials for initial training of an AI model is different from using copyrighted materials for retrieval augmented generation, where AI delivers information to users drawn from scraped original works. This makes sense to me because, as with many (most?) things, context matters. Moreover, numerous cases (including the Supreme Court’s decision in Andy Warhol Foundation v. Goldsmith) stress the importance of analyzing the specific use at issue. However, the Copyright Office also noted that “compiling a dataset or training alone is rarely the ultimate purpose. Fair use must also be evaluated in the context of the overall use.” Which leads us to the next point…
- The report describes how “training a generative AI foundation model on a large and diverse dataset will often be transformative” because it converts a massive collection of copyrighted materials “into a statistical model that can generate a wide range of outputs across a diverse array of new situations.” On the flip side, to the extent a model generates outputs that are similar to the original materials, the less likely those outputs are to be transformative. This could represent trouble for AI platforms that allow users to create outputs that replicate the style of a copyrighted work. In those cases (every case?) where an AI platform allows users to generate entirely original works as well as ones that are similar or identical to copyrighted materials, courts will have to figure out what constitutes fair use.
- There is a common argument advanced by AI platforms that using copyrighted materials to train AI models is “inherently transformative because it is not for expressive purposes” since the models reduce movies, novels and other works to digital tokens. The Copyright Office isn’t buying this. It says changing “O Romeo, Romeo! Wherefore art thou Romeo?” into a string of numbers does not render it non-expressive because that digital information can subsequently be used to create expressive content. This makes a lot of sense. Translating Shakespeare into Russian isn’t transformative, so there’s no good reason that converting it into a “language” readable by a machine should be any different.
- The use of entire copyrighted works for training weighs against a finding of fair use; however, the ingestion of whole works could be fair if a platform implements “guardrails” that prevent a user from obtaining substantial portions of the original work. Again, courts are going to need to examine real world uses and draw lines between those that are ok and those that are not.
- When an AI platform produces work based on its training on copyrighted materials, even if that output lacks protectable elements of the original (for example, the exact melody or lyrics of a song), output that is stylistically similar to an original work could compete with that original work — and this weighs against a finding of fair use.
While at first blush there’s nothing particularly new or revelatory in the report, it is nonetheless effective at concisely synthesizing the issues raised in the various AI copyright-related lawsuits in the courts at the moment (and to come in the future). As such, it highlights the many areas where courts are going to have to define what does and does not constitute fair use, and the even trickier questions of where precisely the lines between will need to be drawn. Fun times ahead!
May 13, 2025
By Emily Poler
We’re well into the first round of litigation over copyright infringement, with cases like the one brought by the New York Times against OpenAI (which I first wrote about here) now well into discovery. Meanwhile, a recent report from the U.S. Copyright office indicates it has, to date, registered more than 1,000 works created with the assistance of artificial intelligence. Obviously, this is just the beginning. Which leads me to, what’s the next front for disputes involving AI and copyright law?
To me, the clear answer is this: How much human authorship is needed for a work created with AI to be copyrightable, and what implications does that have for the defense of AI against copyright infringement claims? And how will courts sort out what is protectable (human created) from what’s not protectable (AI created)?
First, some background.
Dr. Stephen Thaler is a computer scientist who developed an AI he dubbed the “Creativity Machine” (not the most creative name, if you ask me). According to Thaler, his Machine autonomously generated this artwork titled “A Recent Entrance to Paradise.”
Thaler submitted a copyright registration to the U.S. Copyright Office for the image, listing himself as the owner and the Machine as the sole author. (He subsequently changed tactics in an attempt to claim that the artwork was created under the works made for hire provision of the Copyright Act, claiming that the image was a work for hire because he employed the AI that created the artwork.)
The Copyright Office denied the application, saying that only works authored by humans are eligible for copyright protection.
Thaler then filed suit in the U.S. District Court for the District of Columbia against the Copyright Office and its director, Shira Perlmutter. That court sided with the Copyright Office, finding that “human authorship is an essential part of a valid copyright claim.” Most recently, the Court of Appeals for the District of Columbia affirmed the District Court’s finding. The Court of Appeals based its conclusion on a number of provisions in the Copyright Act that reference human attributes — an author’s “nationality or domicile,” surviving spouses and heirs, signature requirements, and the fact that the duration of a copyright is measured with reference to an author’s lifespan — when discussing who is an author. The Court wrote: “Machines do not have property, traditional human lifespans, family members, domiciles, nationalities… or signatures.”
The Court also rejected Thaler’s claims that the artwork was a work for hire, pointing to the requirement in the Copyright Act that all works be created in the first instance by a human being.
This brings me back to where I think we’re going to see copyright litigation. As noted above, the Copyright Office has registered a lot of works created by some combination of human and artificial intelligence. So, what is enough human authorship to make something created in part by AI copyrightable? Where is the line drawn? It’s pretty intriguing. Here’s a crude example: if you prompt an AI with, “create a fantasy landscape with unicorns and dragons,” is the image generated copyrightable? If you give it a detailed list of 47 specific prompts, will the Copyright Office approve? Somewhere in between? How can you calculate the percentage of a creative work attributable to human intervention, and the percentage that is computer processing?
And then there’s the flip side, which I think is even more interesting. If an AI creation isn’t copyrightable, what happens when someone (something?) sues for copyright infringement based on a work that was partially AI generated? Will courts have to ignore the AI-created portion of the work and how do you even figure out what that is? Enterprising defendants (and their counsel) will come up with some interesting arguments, enterprising plaintiffs (and their counsel) will push back, and courts will have to sort it all out.
And that starts to sound, however tentatively, like we’re getting into Terminator territory. So with that, all I can sign off with is, “hasta la vista.”
February 19, 2025
As the lawyers reading this know, media giant Thomson Reuters has a proprietary online research database called Westlaw. In addition to hosting cases and statutes, Westlaw also includes original material written by Westlaw editors. A recent decision involving that original content and its use by Ross Intelligence, a potential Thomson Reuters competitor, to create an AI-powered product may provide a bit of a roadmap on fair use and other issues facing the courts considering cases against OpenAI, Perplexity and other generative AI platforms.
First, some background: while the bulk of Westlaw’s content — statutes, rules, ordinances, cases, administrative codes, etc.— are not subject to copyright protection, Westlaw editors concisely restate the important points of a case with short summaries. Each is called a Headnote. Westlaw organizes Headnotes into something called the West Key Number System, which makes it much easier to find what you’re looking for.
This case began when Ross asked to license Westlaw’s Headnotes to create its own, AI-powered legal research search engine. Not surprisingly, Thomson Reuters didn’t want to help create a competitor and said no.
As a workaround, Ross hired a company called LegalEase, which in turn hired a bunch of lawyers to create training data for Ross’ AI. This training data took the form of a list of questions, each with correct and incorrect answers. While the lawyers answering these questions were told not to simply cut and paste Headnotes, the answers were formulated using Westlaw’s Headnotes and the West Key Number System. LegalEase called these “Bulk Memos.”
Thomson Reuters was none too happy about this and sued Ross for, among other things, copyright infringement, claiming that “Ross built its competing product from Bulk Memos, which in turn were built from Westlaw [H]eadnotes.” In its defense, Ross claimed that Westlaw’s Headnotes were not subject to copyright protection, and that to the extent it infringed on Thomson Reuters’ copyrights, its use constituted fair use.
In 2023 the Court largely denied Thomson Reuters’ motion for summary judgment, ruling that, among other things, the question of whether Headnotes qualify for copyright protection would have to be decided by a jury. The Court, however, subsequently had a change of heart and asked Thomson Reuters and Ross to renew their motions for summary judgment. Earlier this month, the Court ruled on these renewed motions.
Of note, the Court found that at least some Headnotes qualified for copyright protection, as did the West Key Number System. On the Headnotes, the Court found that the effort of “distilling, synthesizing, or explaining” a judicial opinion was sufficiently original to qualify for copyright protection. The Court also found the West Key Number System to be sufficiently original to clear the “minimal threshold for originality” required for copyright protection. The Court further found that the Bulk Memos infringed on some of the Headnotes.
The Court also rejected Ross’ assertion of fair use. Its decision was based largely on the fact that Ross was using Thomson Reuters’ Headnotes to create a competing product. Here, the Court looked at not only Thomson Reuters’ current market, but also potential markets it might develop, finding that since Thomson Reuters might create its own AI products the Ross product could negatively impact the market for Thomson Reuters, which weighed against fair use.
The Court was not impressed with Ross’ reliance on a line of cases finding copying of computer code at an intermediate step to be fair use. Here, the Court noted that Ross was not copying computer code. Moreover, in those cases, the copying was necessary to access purely functional elements of a computer program and achieve new, transformative purposes. In contrast, Ross used Headnotes to make it easier to develop a competitive product.
Ultimately, these conclusions are most interesting because of what other courts hearing AI infringement cases may take from them. Sure, there are differences (notably, Ross doesn’t seem to be using generative AI), but this case highlights some of the legal and factual issues we’re going to see as other cases move forward. In particular, I think the fact that the Court here found that the process of summarizing or distilling longer cases into Headnotes renders the Headnotes subject to copyright protection may be problematic for companies such as OpenAI, which has tried to claim that it is only ingesting underlying facts from news articles. If creating Headnotes is sufficiently original to qualify for copyright protection, then it seems likely that a reporter selecting the facts to include in a news article is also sufficiently original.
Stay tuned. There is much, much more to come.
February 11, 2025
And this week, it’s DeepSeek. Every few days it seems there’s something new dominating tech headlines, and since right now it’s the low-cost, low-energy Chinese AI roiling world governments and markets, I thought I’d use this week’s post to take a look at some portions of DeepSeek’s Terms of Use (ToU). Of course, keep in mind nothing I write here is legal advice and, as I’ve covered at greater length previously, there’s a whole lot of uncertainty about the rules governing the creation of large language and diffusion models, as well as their outputs. But that doesn’t mean there’s not a lot to chew on already.
With that disclaimer out of the way, I’m going to start with something that’s rather mundane, but where litigators’ minds tend to go right off the bat: forum selection. For the non-attorneys out there, that’s where a lawsuit against DeepSeek would have to be brought. What do DeepSeek’s ToU say? “In the event of a dispute arising from the signing, performance, or interpretation of these Terms, the Parties shall make efforts to resolve it amicably through negotiation. If negotiation fails, either Party has the right to file a lawsuit with a court having jurisdiction over the location of the registered office of Hangzhou DeepSeek Artificial Intelligence Co., Ltd.”
In other words, if you want to sue DeepSeek, you have to do so in China. This is not atypical — technology companies generally include favorable forum selection clauses in their ToU — but from an American perspective, this will make it hard or impossible for most US-based DeepSeek users to sue the company in the event of a dispute.
More disturbing is section 4.2 of DeepSeek’s ToU: “Subject to applicable law and our Terms, you have the following rights regarding the Inputs and Outputs of the Services: (1) You retain any rights, title, and interests—if any—in the Inputs you submit; (2) We assign any rights, title, and interests—if any—in the Outputs of the Services to you.” Sounds benign, right?
Nope. What it really means is if DeepSeek decides a user has violated its ToU (or Chinese law), it could unilaterally decide that the user has given up rights to its materials and/or the rights to use output from DeepSeek. This means DeepSeek could use this provision to claim ownership over the material users put into DeepSeek, or could sue a user who includes output generated by DeepSeek in any of their own commercial activities. People and organizations will have to make their own calls about whether this is an acceptable risk but, on top of the fact that any user who thinks their rights have been improperly rescinded would have to seek legal recourse in a Chinese court, this seems, um, bad.
I should also mention that the privacy and national security concerns involved in using DeepSeek are well above my pay grade — but I’d love to hear your thoughts on them. I’m particularly curious what privacy attorneys think about the provisions around the platform’s use by minors (“DeepSeek fully understands the importance of protecting minors and will take corresponding protective measures in accordance with legal requirements and industry mainstream practices”); and reports that a DeepSeek database containing sensitive information was publicly accessible. Neither the vague language on the protection of minors nor DeepSeek’s failure to protect its information inspires confidence. But I’m not a privacy lawyer so maybe I’m missing something.
Lastly, one especially amusing thing has come from the DeepSeek splash: OpenAI (creators of ChatGPT) has publicly accused DeepSeek of using its output to train DeepSeek’s AI, complaining that it is a violation of OpenAI’s terms of service. Ha! OpenAI, of course, is currently embroiled in several copyright infringement lawsuits (which I’ve covered here) with the New York Times and others over OpenAI’s use of their content to train its algorithms (and presumably compete with them). Oh, the irony.