Copyright

May 27, 2025

Copyright, Fair Use and AI: A Kinda Official Report

By Emily Poler
Following up on my last post in which we, in part, discussed the United States Copyright Office and its developing standards for determining whether works created with artificial intelligence are eligible for copyright, we can now dive deeply into the recent “pre-publication” release of a report detailing that Office’s thinking on the topic of AI and fair use. A final version of the report is supposed to be published in the near future; however, the Trump Administration’s termination of the Director of the US Copyright Office, which, according to some, is linked to the Office’s issuance of this report, makes me wonder if this time frame might change.

Why is this important? Given the Copyright Office’s role as the government body to which someone who wants to register a copyright goes to file an application, as well as advising Congress on matters related to copyright, its reports are likely to influence the judges currently considering the lawsuits against the tech companies that own OpenAI and its ilk.

As a reminder, the fair use doctrine (some background here) allows use of copyrighted material under certain circumstances to be considered non-infringing (i.e., fair). Where a use is fair, the entity repurposing the copyrighted materials does not have to pay a license fee to the original creator. Courts have developed a four-part framework for determining if a new use furthers the ultimate goal of copyright — the promotion of the “Progress of Science and useful Arts,” U.S. Constit., Art. I, Section 8 (and yes, that capitalization is in the original). This involves considering things like the degree to which the new work transforms the original and the extent to which the new work can substitute for the original work.

Overall, the Copyright Office’s report is quite interesting, replete with good background for anyone wanting to understand generative AI, and to absorb the issues related to AI and copyright. Here are some highlights (and my thoughts):

Different uses of copyrighted materials should be treated differently. (Okay, that’s maybe not so surprising). For example, in the Copyright Office’s analysis, using copyrighted materials for initial training of an AI model is different from using copyrighted materials for retrieval augmented generation, where AI delivers information to users drawn from scraped original works. This makes sense to me because, as with many (most?) things, context matters. Moreover, numerous cases (including the Supreme Court’s decision in Andy Warhol Foundation v. Goldsmith) stress the importance of analyzing the specific use at issue. However, the Copyright Office also noted that “compiling a dataset or training alone is rarely the ultimate purpose. Fair use must also be evaluated in the context of the overall use.” Which leads us to the next point…
The report describes how “training a generative AI foundation model on a large and diverse dataset will often be transformative” because it converts a massive collection of copyrighted materials “into a statistical model that can generate a wide range of outputs across a diverse array of new situations.” On the flip side, to the extent a model generates outputs that are similar to the original materials, the less likely those outputs are to be transformative. This could represent trouble for AI platforms that allow users to create outputs that replicate the style of a copyrighted work. In those cases (every case?) where an AI platform allows users to generate entirely original works as well as ones that are similar or identical to copyrighted materials, courts will have to figure out what constitutes fair use.
There is a common argument advanced by AI platforms that using copyrighted materials to train AI models is “inherently transformative because it is not for expressive purposes” since the models reduce movies, novels and other works to digital tokens. The Copyright Office isn’t buying this. It says changing “O Romeo, Romeo! Wherefore art thou Romeo?” into a string of numbers does not render it non-expressive because that digital information can subsequently be used to create expressive content. This makes a lot of sense. Translating Shakespeare into Russian isn’t transformative, so there’s no good reason that converting it into a “language” readable by a machine should be any different.
The use of entire copyrighted works for training weighs against a finding of fair use; however, the ingestion of whole works could be fair if a platform implements “guardrails” that prevent a user from obtaining substantial portions of the original work. Again, courts are going to need to examine real world uses and draw lines between those that are ok and those that are not.
When an AI platform produces work based on its training on copyrighted materials, even if that output lacks protectable elements of the original (for example, the exact melody or lyrics of a song), output that is stylistically similar to an original work could compete with that original work — and this weighs against a finding of fair use.

While at first blush there’s nothing particularly new or revelatory in the report, it is nonetheless effective at concisely synthesizing the issues raised in the various AI copyright-related lawsuits in the courts at the moment (and to come in the future). As such, it highlights the many areas where courts are going to have to define what does and does not constitute fair use, and the even trickier questions of where precisely the lines between will need to be drawn. Fun times ahead!

May 13, 2025

How Much Human Required: The Copyright Edition

By Emily Poler
We’re well into the first round of litigation over copyright infringement, with cases like the one brought by the New York Times against OpenAI (which I first wrote about here) now well into discovery. Meanwhile, a recent report from the U.S. Copyright office indicates it has, to date, registered more than 1,000 works created with the assistance of artificial intelligence. Obviously, this is just the beginning. Which leads me to, what’s the next front for disputes involving AI and copyright law?

To me, the clear answer is this: How much human authorship is needed for a work created with AI to be copyrightable, and what implications does that have for the defense of AI against copyright infringement claims? And how will courts sort out what is protectable (human created) from what’s not protectable (AI created)?

First, some background.

Dr. Stephen Thaler is a computer scientist who developed an AI he dubbed the “Creativity Machine” (not the most creative name, if you ask me). According to Thaler, his Machine autonomously generated this artwork titled “A Recent Entrance to Paradise.”

Thaler submitted a copyright registration to the U.S. Copyright Office for the image, listing himself as the owner and the Machine as the sole author. (He subsequently changed tactics in an attempt to claim that the artwork was created under the works made for hire provision of the Copyright Act, claiming that the image was a work for hire because he employed the AI that created the artwork.)

The Copyright Office denied the application, saying that only works authored by humans are eligible for copyright protection.

Thaler then filed suit in the U.S. District Court for the District of Columbia against the Copyright Office and its director, Shira Perlmutter. That court sided with the Copyright Office, finding that “human authorship is an essential part of a valid copyright claim.” Most recently, the Court of Appeals for the District of Columbia affirmed the District Court’s finding. The Court of Appeals based its conclusion on a number of provisions in the Copyright Act that reference human attributes — an author’s “nationality or domicile,” surviving spouses and heirs, signature requirements, and the fact that the duration of a copyright is measured with reference to an author’s lifespan — when discussing who is an author. The Court wrote: “Machines do not have property, traditional human lifespans, family members, domiciles, nationalities… or signatures.”

The Court also rejected Thaler’s claims that the artwork was a work for hire, pointing to the requirement in the Copyright Act that all works be created in the first instance by a human being.

This brings me back to where I think we’re going to see copyright litigation. As noted above, the Copyright Office has registered a lot of works created by some combination of human and artificial intelligence. So, what is enough human authorship to make something created in part by AI copyrightable? Where is the line drawn? It’s pretty intriguing. Here’s a crude example: if you prompt an AI with, “create a fantasy landscape with unicorns and dragons,” is the image generated copyrightable? If you give it a detailed list of 47 specific prompts, will the Copyright Office approve? Somewhere in between? How can you calculate the percentage of a creative work attributable to human intervention, and the percentage that is computer processing?

And then there’s the flip side, which I think is even more interesting. If an AI creation isn’t copyrightable, what happens when someone (something?) sues for copyright infringement based on a work that was partially AI generated? Will courts have to ignore the AI-created portion of the work and how do you even figure out what that is? Enterprising defendants (and their counsel) will come up with some interesting arguments, enterprising plaintiffs (and their counsel) will push back, and courts will have to sort it all out.

And that starts to sound, however tentatively, like we’re getting into Terminator territory. So with that, all I can sign off with is, “hasta la vista.”

February 19, 2025

A Hint of How AI Infringement Suits Will Go?

As the lawyers reading this know, media giant Thomson Reuters has a proprietary online research database called Westlaw. In addition to hosting cases and statutes, Westlaw also includes original material written by Westlaw editors. A recent decision involving that original content and its use by Ross Intelligence, a potential Thomson Reuters competitor, to create an AI-powered product may provide a bit of a roadmap on fair use and other issues facing the courts considering cases against OpenAI, Perplexity and other generative AI platforms.

First, some background: while the bulk of Westlaw’s content — statutes, rules, ordinances, cases, administrative codes, etc.— are not subject to copyright protection, Westlaw editors concisely restate the important points of a case with short summaries. Each is called a Headnote. Westlaw organizes Headnotes into something called the West Key Number System, which makes it much easier to find what you’re looking for.

This case began when Ross asked to license Westlaw’s Headnotes to create its own, AI-powered legal research search engine. Not surprisingly, Thomson Reuters didn’t want to help create a competitor and said no.

As a workaround, Ross hired a company called LegalEase, which in turn hired a bunch of lawyers to create training data for Ross’ AI. This training data took the form of a list of questions, each with correct and incorrect answers. While the lawyers answering these questions were told not to simply cut and paste Headnotes, the answers were formulated using Westlaw’s Headnotes and the West Key Number System. LegalEase called these “Bulk Memos.”

Thomson Reuters was none too happy about this and sued Ross for, among other things, copyright infringement, claiming that “Ross built its competing product from Bulk Memos, which in turn were built from Westlaw [H]eadnotes.” In its defense, Ross claimed that Westlaw’s Headnotes were not subject to copyright protection, and that to the extent it infringed on Thomson Reuters’ copyrights, its use constituted fair use.

In 2023 the Court largely denied Thomson Reuters’ motion for summary judgment, ruling that, among other things, the question of whether Headnotes qualify for copyright protection would have to be decided by a jury. The Court, however, subsequently had a change of heart and asked Thomson Reuters and Ross to renew their motions for summary judgment. Earlier this month, the Court ruled on these renewed motions.

Of note, the Court found that at least some Headnotes qualified for copyright protection, as did the West Key Number System. On the Headnotes, the Court found that the effort of “distilling, synthesizing, or explaining” a judicial opinion was sufficiently original to qualify for copyright protection. The Court also found the West Key Number System to be sufficiently original to clear the “minimal threshold for originality” required for copyright protection. The Court further found that the Bulk Memos infringed on some of the Headnotes.

The Court also rejected Ross’ assertion of fair use. Its decision was based largely on the fact that Ross was using Thomson Reuters’ Headnotes to create a competing product. Here, the Court looked at not only Thomson Reuters’ current market, but also potential markets it might develop, finding that since Thomson Reuters might create its own AI products the Ross product could negatively impact the market for Thomson Reuters, which weighed against fair use.

The Court was not impressed with Ross’ reliance on a line of cases finding copying of computer code at an intermediate step to be fair use. Here, the Court noted that Ross was not copying computer code. Moreover, in those cases, the copying was necessary to access purely functional elements of a computer program and achieve new, transformative purposes. In contrast, Ross used Headnotes to make it easier to develop a competitive product.

Ultimately, these conclusions are most interesting because of what other courts hearing AI infringement cases may take from them. Sure, there are differences (notably, Ross doesn’t seem to be using generative AI), but this case highlights some of the legal and factual issues we’re going to see as other cases move forward. In particular, I think the fact that the Court here found that the process of summarizing or distilling longer cases into Headnotes renders the Headnotes subject to copyright protection may be problematic for companies such as OpenAI, which has tried to claim that it is only ingesting underlying facts from news articles. If creating Headnotes is sufficiently original to qualify for copyright protection, then it seems likely that a reporter selecting the facts to include in a news article is also sufficiently original.

Stay tuned. There is much, much more to come.

December 17, 2024

Perplexity and the Perplexing Legalities of Data Scraping

Of the many lawsuits media giants have filed against AI companies for copyright infringement, the one filed by Dow Jones & Co. (publisher of the Wall Street Journal) and NYP Holdings Inc. (publisher of the New York Post) against Perplexity AI adds a new wrinkle.

Perplexity is a natural-language search engine that generates answers to user questions by scraping information from sources across the web, synthesizing the data and presenting it in an easily-digestible chatbot interface. Its makers call it an “answer engine” because it’s meant to function like a mix of Wikipedia and ChatGPT. The plaintiffs, however, call it a thief that is violating Internet norms to take their content without compensation.

To me, this represents a particularly stark example of the problems with how AI platforms are operating vis-a-vis copyrighted materials, and one well worth analyzing.

According to its website, Perplexity pulls information “from the Internet the moment you ask a question, so information is always up-to-date.” Its AI seems to work by combining a large language model (LLM) with retrieval-augmented generation (RAG — oh, the acronyms!). As this is a blog about the law, not computer science, I won’t get too deep into this but Perplexity uses AI to improve a user’s question and then searches the web for up-to-date info, which it synthesizes into a seemingly clear, concise and authoritative answer. Perplexity’s business model appears to be that people will gather information through Perplexity (paying for upgraded “Pro” access) instead of doing a traditional web search that returns links the user then follows to the primary sources of the information (which is one way those media sources generate subscriptions and ad views).

Part of this requires Perplexity to scrape the websites of news outlets and other sources. Web scraping is an automated method to quickly extract large amounts of data from websites, using bots to find requested information by analyzing the HTML content of web pages, locating and extracting the desired data and then aggregating it into a structured format (like a spreadsheet or database) specified by the user. The data acquired this way can then be repurposed as the party doing the gathering sees fit. Is this copyright infringement? Probably, because copyright infringement is when you copy copyrighted material without permission.

To make matters worse, at least according to Dow Jones and NYP Holdings, Perplexity seems to have ignored the Robots Exclusion Protocol. This is a tool that, among other things, instructs scraping bots not to copy copyrighted materials. However, despite the fact that these media outlets deploy this protocol, Perplexity spits out verbatim copies of some of the Plaintiff’s articles and other materials.

Of course, Perplexity has a defense, of sorts. Its CEO accuses the Plaintiffs and other media companies of being incredibly short sighted, and wishing for a world in which AI didn’t exist. Perplexity says that media companies should work with, not against, AI companies to develop shared platforms. It’s not entirely clear what financial incentives Perplexity has or will offer to these and other content creators.

Moreover, it seems like Perplexity is the one that is incredibly shortsighted. The whole premise of copyright law is that if people are economically rewarded they will create new, useful and insightful (or at least, entertaining) materials. If Perplexity had its way, these creators would not be paid at all or accept whatever it is that Perplexity deigns to offer. Presumably, this would not end well for the content creators and there would be no more reliable, up-to-date information to scrape. Moreover, Perplexity’s self-righteous claim that media companies just want to go back to the Stone Age (i.e., the 20th century) seems premised on a desire for a world in which the law allows anyone who wants copyrighted material to just take it without paying for it. And that’s not how the world works — at least for now.

September 9, 2024

Fair Use or Foul Play? Free Books Cross the Line

Last week, the U.S. Court of Appeals for the Second Circuit affirmed a federal judge’s March 2023 holding that the Internet Archive’s practice of digitizing library books and making them freely available to readers on a strict one-to-one ratio was not fair use. For reasons I’ll get into below, the outcome is pretty unsurprising. It’s also worth looking at because it likely previews some of the arguments we’ll hear in the case between the New York Times and OpenAI (creators of ChatGPT) and Microsoft if (or when) that case makes it to the Second Circuit. (Quick summary of my post on the subject: The New York Times Company filed suit late in December against Microsoft and several OpenAI affiliates, alleging that by using New York Times content to train its algorithms, the defendants infringed on the media giant’s copyrights, among other things.)

First, some background. The Internet Archive is a not-for-profit organization “building a digital library of Internet sites and other cultural artifacts in digital form” whose “mission is to provide Universal Access to All Knowledge.” To achieve this rather lofty goal, the Archive created its Open Library by scanning printed books in its possession or in the possession of a partner library and lending out one digital copy of a physical book at a time, in a system it dubs Controlled Digital Lending.

Enter COVID-19. During the height of the pandemic, when everyone was stuck at home without much to do, the Archive launched the National Emergency Library. This did away with Controlled Digital Lending and allowed almost unlimited access to each digitized book in its collection.

Not surprisingly, book publishers, who sell electronic copies of books to both individuals and libraries, were not thrilled. Four big-time publishers — Hachette, Penguin Random House, Wiley, and HarperCollins — sued the Internet Archive for copyright infringement, targeting both its National Emergency Library and Open Library as “willful digital piracy on an industrial scale.”

The Internet Archive responded that these projects constituted fair use and, therefore, did not infringe on the publisher’s copyrights. To back this up, the Archive claimed it was using technology “to make lending more convenient and efficient” because its work allowed users to do things that were not possible with physical books, such as permitting “authors writing online articles [to] link directly to” a digital book in the Archive’s library. The Archive also insisted its library was not supplanting the market for the publisher’s products.

The District Court rejected these arguments, holding that no case or legal principle supported the Archive’s defense that “lawfully acquiring a copyrighted print book entitles the recipient to make an unauthorized copy and distribute it in place of the print book, so long as it does not simultaneously lend the print book.” The judge also deemed the concept of Controlled Digital Lending “an invented paradigm that is well outside of copyright law.”

In affirming the District Court’s ruling, the Second Circuit Court applied the four-part test for fair use that looks at: (1) the purpose and character of the use; (2) the nature of the copyright work; (3) the portion of the copyrighted work used (as compared to the entirety of the copyrighted work); and (4) the impact of the allegedly fair use on the potential market for or value of the copyrighted work.

The first factor — the purpose and character of the use — is broken down into two subsidiary questions: Does the new work transform the original, and is it of a commercial nature or is it for educational purposes? Neither the District Court nor the Court of Appeals bought the Internet Archive’s claim that its Open Library was transformative. The Court of Appeals held that the digital books provided by the Internet Archive “serve the same exact purpose as the original; making the authors’ works available to read.” (The Court of Appeals did find that, as a not-for-profit entity, the Internet Archive’s use of the books was not commercial.)

On the second factor, which is generally unimportant here, the Court of Appeals also found in favor of the publishers. Of greater significance is factor three, which looks at how much of the copyrighted work is at issue. Copying a sentence or a paragraph of a book length work is more likely to be fair use than copying the entire book which, of course, is exactly what the Internet Archive was doing. Again, another win for the publishers.

And arguments on factor four — the impact on the market for the publishers’ products — didn’t work out any better for the Internet Archive. Notably, the Court of Appeals found that the Internet Archive was copying the publishers’ books for the exact same purpose as the original works offered by the publisher, thus naturally impacting their market and value.

So what are the takeaways here as we look ahead to the case between the New York Times and Open AI/Microsoft?

On the one hand, OpenAI/Microsoft have copied entire articles from the Times (and the numerous other plaintiffs that are suing OpenAI and Microsoft), which will hurt OpenAI/Microsoft claims of fair use. Likewise, OpenAI/Microsoft’s fair use arguments won’t get very far if the Times can show that ChatGPT’s works are negatively impacting the market for its work or functioning as a substitute for journalism.

On the other hand, if OpenAI/Microsoft can show that ChatGPT’s output transformed the Times’ content, it may be able to prevail on fair use.

In any event, the case between OpenAI and Microsoft and The New York Times is likely to include a lot more ambiguity than in the Internet Archive matter, with the potential to result in new interpretations of copyright law with massive consequences for media and technology companies worldwide.