AI

Copyright, Fair Use and AI: A Kinda Official Report

By Emily Poler
Following up on my last post in which we, in part, discussed the United States Copyright Office and its developing standards for determining whether works created with artificial intelligence are eligible for copyright, we can now dive deeply into the recent “pre-publication” release of a report detailing that Office’s thinking on the topic of AI and fair use. A final version of the report is supposed to be published in the near future; however, the Trump Administration’s termination of the Director of the US Copyright Office, which, according to
some, is linked to the Office’s issuance of this report, makes me wonder if this time frame might change.

Why is this important? Given the Copyright Office’s role as the government body to which someone who wants to register a copyright goes to file an application, as well as advising Congress on matters related to copyright, its reports are likely to influence the judges currently considering the lawsuits against the tech companies that own OpenAI and its ilk.

As a reminder, the fair use doctrine (some background here) allows use of copyrighted material under certain circumstances to be considered non-infringing (i.e., fair). Where a use is fair, the entity repurposing the copyrighted materials does not have to pay a license fee to the original creator. Courts have developed a four-part framework for determining if a new use furthers the ultimate goal of copyright — the promotion of the “Progress of Science and useful Arts,” U.S. Constit., Art. I, Section 8 (and yes, that capitalization is in the original). This involves considering things like the degree to which the new work transforms the original and the extent to which the new work can substitute for the original work.

Overall, the Copyright Office’s report is quite interesting, replete with good background for anyone wanting to understand generative AI, and to absorb the issues related to AI and copyright. Here are some highlights (and my thoughts): 

  • Different uses of copyrighted materials should be treated differently. (Okay, that’s maybe not so surprising). For example, in the Copyright Office’s analysis, using copyrighted materials for initial training of an AI model is different from using copyrighted materials for retrieval augmented generation, where AI delivers information to users drawn from scraped original works. This makes sense to me because, as with many (most?) things, context matters. Moreover, numerous cases (including the Supreme Court’s decision in Andy Warhol Foundation v. Goldsmith) stress the importance of analyzing the specific use at issue. However, the Copyright Office also noted that “compiling a dataset or training alone is rarely the ultimate purpose. Fair use must also be evaluated in the context of the overall use.” Which leads us to the next point…
  • The report describes how “training a generative AI foundation model on a large and diverse dataset will often be transformative” because it converts a massive collection of copyrighted materials “into a statistical model that can generate a wide range of outputs across a diverse array of new situations.” On the flip side, to the extent a model generates outputs that are similar to the original materials, the less likely those outputs are to be transformative. This could represent trouble for AI platforms that allow users to create outputs that replicate the style of a copyrighted work. In those cases (every case?) where an AI platform allows users to generate entirely original works as well as ones that are similar or identical to copyrighted materials, courts will have to figure out what constitutes fair use. 
  • There is a common argument advanced by AI platforms that using copyrighted materials to train AI models is “inherently transformative because it is not for expressive purposes” since the models reduce movies, novels and other works to digital tokens. The Copyright Office isn’t buying this. It says changing “O Romeo, Romeo! Wherefore art thou Romeo?” into a string of numbers does not render it non-expressive because that digital information can subsequently be used to create expressive content. This makes a lot of sense. Translating Shakespeare into Russian isn’t transformative, so there’s no good reason that converting it into a “language” readable by a machine should be any different.
  • The use of entire copyrighted works for training weighs against a finding of fair use; however, the ingestion of whole works could be fair if a platform implements “guardrails” that prevent a user from obtaining substantial portions of the original work. Again, courts are going to need to examine real world uses and draw lines between those that are ok and those that are not.
  • When an AI platform produces work based on its training on copyrighted materials, even if that output lacks protectable elements of the original (for example, the exact melody or lyrics of a song), output that is stylistically similar to an original work could compete with that original work — and this weighs against a finding of fair use.

While at first blush there’s nothing particularly new or revelatory in the report, it is nonetheless effective at concisely synthesizing the issues raised in the various AI copyright-related lawsuits in the courts at the moment (and to come in the future). As such, it highlights the many areas where courts are going to have to define what does and does not constitute fair use, and the even trickier questions of where precisely the lines between will need to be drawn. Fun times ahead!

How Much Human Required: The Copyright Edition

By Emily Poler
We’re well into the first round of litigation over copyright infringement, with cases like the one brought by the
New York Times against OpenAI (which I first wrote about here) now well into discovery. Meanwhile, a recent report from the U.S. Copyright office indicates it has, to date, registered more than 1,000 works created with the assistance of artificial intelligence. Obviously, this is just the beginning. Which leads me to, what’s the next front for disputes involving AI and copyright law?

To me, the clear answer is this: How much human authorship is needed for a work created with AI to be copyrightable, and what implications does that have for the defense of AI against copyright infringement claims? And how will courts sort out what is protectable (human created) from what’s not protectable (AI created)?

First, some background. 

Dr. Stephen Thaler is a computer scientist who developed an AI he dubbed the “Creativity Machine” (not the most creative name, if you ask me). According to Thaler, his Machine autonomously generated this artwork titled “A Recent Entrance to Paradise.” 

Thaler submitted a copyright registration to the U.S. Copyright Office for the image, listing himself as the owner and the Machine as the sole author. (He subsequently changed tactics in an attempt to claim that the artwork was created under the works made for hire provision of the Copyright Act, claiming that the image was a work for hire because he employed the AI that created the artwork.)

The Copyright Office denied the application, saying that only works authored by humans are eligible for copyright protection. 

Thaler then filed suit in the U.S. District Court for the District of Columbia against the Copyright Office and its director, Shira Perlmutter. That court sided with the Copyright Office, finding that “human authorship is an essential part of a valid copyright claim.” Most recently, the Court of Appeals for the District of Columbia affirmed the District Court’s finding. The Court of Appeals based its conclusion on a number of provisions in the Copyright Act that reference human attributes — an author’s “nationality or domicile,” surviving spouses and heirs, signature requirements, and the fact that the duration of a copyright is measured with reference to an author’s lifespan — when discussing who is an author. The Court wrote: “Machines do not have property, traditional human lifespans, family members, domiciles, nationalities… or signatures.” 

The Court also rejected Thaler’s claims that the artwork was a work for hire, pointing to the requirement in the Copyright Act that all works be created in the first instance by a human being. 

This brings me back to where I think we’re going to see copyright litigation. As noted above, the Copyright Office has registered a lot of works created by some combination of human and artificial intelligence. So, what is enough human authorship to make something created in part by AI copyrightable? Where is the line drawn? It’s pretty intriguing. Here’s a crude example: if you prompt an AI with, “create a fantasy landscape with unicorns and dragons,” is the image generated copyrightable? If you give it a detailed list of 47 specific prompts, will the Copyright Office approve? Somewhere in between? How can you calculate the percentage of a creative work attributable to human intervention, and the percentage that is computer processing?

And then there’s the flip side, which I think is even more interesting. If an AI creation isn’t copyrightable, what happens when someone (something?) sues for copyright infringement based on a work that was partially AI generated? Will courts have to ignore the AI-created portion of the work and how do you even figure out what that is? Enterprising defendants (and their counsel) will come up with some interesting arguments, enterprising plaintiffs (and their counsel) will push back, and courts will have to sort it all out.

And that starts to sound, however tentatively, like we’re getting into Terminator territory. So with that, all I can sign off with is, “hasta la vista.”

A Hint of How AI Infringement Suits Will Go?

As the lawyers reading this know, media giant Thomson Reuters has a proprietary online research database called Westlaw. In addition to hosting cases and statutes, Westlaw also includes original material written by Westlaw editors. A recent decision involving that original content and its use by Ross Intelligence, a potential Thomson Reuters competitor, to create an AI-powered product may provide a bit of a roadmap on fair use and other issues facing the courts considering cases against OpenAI, Perplexity and other generative AI platforms.

First, some background: while the bulk of Westlaw’s content — statutes, rules, ordinances, cases, administrative codes, etc.— are not subject to copyright protection, Westlaw editors concisely restate the important points of a case with short summaries. Each is called a Headnote. Westlaw organizes Headnotes into something called the West Key Number System, which makes it much easier to find what you’re looking for. 

This case began when Ross asked to license Westlaw’s Headnotes to create its own, AI-powered legal research search engine. Not surprisingly, Thomson Reuters didn’t want to help create a competitor and said no. 

As a workaround, Ross hired a company called LegalEase, which in turn hired a bunch of lawyers to create training data for Ross’ AI. This training data took the form of a list of questions, each with correct and incorrect answers. While the lawyers answering these questions were told not to simply cut and paste Headnotes, the answers were formulated using Westlaw’s Headnotes and the West Key Number System. LegalEase called these “Bulk Memos.” 

Thomson Reuters was none too happy about this and sued Ross for, among other things, copyright infringement, claiming that “Ross built its competing product from Bulk Memos, which in turn were built from Westlaw [H]eadnotes.” In its defense, Ross claimed that Westlaw’s Headnotes were not subject to copyright protection, and that to the extent it infringed on Thomson Reuters’ copyrights, its use constituted fair use. 

In 2023 the Court largely denied Thomson Reuters’ motion for summary judgment, ruling that, among other things, the question of whether Headnotes qualify for copyright protection would have to be decided by a jury. The Court, however, subsequently had a change of heart and asked Thomson Reuters and Ross to renew their motions for summary judgment. Earlier this month, the Court ruled on these renewed motions. 

Of note, the Court found that at least some Headnotes qualified for copyright protection, as did the West Key Number System. On the Headnotes, the Court found that the effort of “distilling, synthesizing, or explaining” a judicial opinion was sufficiently original to qualify for copyright protection. The Court also found the West Key Number System to be sufficiently original to clear the “minimal threshold for originality” required for copyright protection. The Court further found that the Bulk Memos infringed on some of the Headnotes.

The Court also rejected Ross’ assertion of fair use. Its decision was based largely on the fact that Ross was using Thomson Reuters’ Headnotes to create a competing product. Here, the Court looked at not only Thomson Reuters’ current market, but also potential markets it might develop, finding that since Thomson Reuters might create its own AI products the Ross product could negatively impact the market for Thomson Reuters, which weighed against fair use. 

The Court was not impressed with Ross’ reliance on a line of cases finding copying of computer code at an intermediate step to be fair use. Here, the Court noted that Ross was not copying computer code. Moreover, in those cases, the copying was necessary to access purely functional elements of a computer program and achieve new, transformative purposes. In contrast, Ross used Headnotes to make it easier to develop a competitive product. 

Ultimately, these conclusions are most interesting because of what other courts hearing AI infringement cases may take from them. Sure, there are differences (notably, Ross doesn’t seem to be using generative AI), but this case highlights some of the legal and factual issues we’re going to see as other cases move forward. In particular, I think the fact that the Court here found that the process of summarizing or distilling longer cases into Headnotes renders the Headnotes subject to copyright protection may be problematic for companies such as OpenAI, which has tried to claim that it is only ingesting underlying facts from news articles. If creating Headnotes is sufficiently original to qualify for copyright protection, then it seems likely that a reporter selecting the facts to include in a news article is also sufficiently original. 

Stay tuned. There is much, much more to come.

Get Ready: DeepSeek is Here

And this week, it’s DeepSeek. Every few days it seems there’s something new dominating tech headlines, and since right now it’s the low-cost, low-energy Chinese AI roiling world governments and markets, I thought I’d use this week’s post to take a look at some portions of DeepSeek’s Terms of Use (ToU). Of course, keep in mind nothing I write here is legal advice and, as I’ve covered at greater length previously, there’s a whole lot of uncertainty about the rules governing the creation of large language and diffusion models, as well as their outputs. But that doesn’t mean there’s not a lot to chew on already.

With that disclaimer out of the way, I’m going to start with something that’s rather mundane, but where litigators’ minds tend to go right off the bat: forum selection. For the non-attorneys out there, that’s where a lawsuit against DeepSeek would have to be brought. What do DeepSeek’s ToU say? “In the event of a dispute arising from the signing, performance, or interpretation of these Terms, the Parties shall make efforts to resolve it amicably through negotiation. If negotiation fails, either Party has the right to file a lawsuit with a court having jurisdiction over the location of the registered office of Hangzhou DeepSeek Artificial Intelligence Co., Ltd.”

In other words, if you want to sue DeepSeek, you have to do so in China. This is not atypical — technology companies generally include favorable forum selection clauses in their ToU — but from an American perspective, this will make it hard or impossible for most US-based DeepSeek users to sue the company in the event of a dispute. 

More disturbing is section 4.2 of DeepSeek’s ToU: “Subject to applicable law and our Terms, you have the following rights regarding the Inputs and Outputs of the Services: (1) You retain any rights, title, and interests—if any—in the Inputs you submit; (2) We assign any rights, title, and interests—if any—in the Outputs of the Services to you.” Sounds benign, right?

Nope. What it really means is if DeepSeek decides a user has violated its ToU (or Chinese law), it could unilaterally decide that the user has given up rights to its materials and/or the rights to use output from DeepSeek. This means DeepSeek could use this provision to claim ownership over the material users put into DeepSeek, or could sue a user who includes output generated by DeepSeek in any of their own commercial activities. People and organizations will have to make their own calls about whether this is an acceptable risk but, on top of the fact that any user who thinks their rights have been improperly rescinded would have to seek legal recourse in a Chinese court, this seems, um, bad.  

I should also mention that the privacy and national security concerns involved in using DeepSeek are well above my pay grade — but I’d love to hear your thoughts on them. I’m particularly curious what privacy attorneys think about the provisions around the platform’s use by minors (“DeepSeek fully understands the importance of protecting minors and will take corresponding protective measures in accordance with legal requirements and industry mainstream practices”); and reports that a DeepSeek database containing sensitive information was publicly accessible. Neither the vague language on the protection of minors nor DeepSeek’s failure to protect its information inspires confidence. But I’m not a privacy lawyer so maybe I’m missing something.  

Lastly, one especially amusing thing has come from the DeepSeek splash: OpenAI (creators of ChatGPT) has publicly accused DeepSeek of using its output to train DeepSeek’s AI, complaining that it is a violation of OpenAI’s terms of service. Ha! OpenAI, of course, is currently embroiled in several copyright infringement lawsuits (which I’ve covered here) with the New York Times and others over OpenAI’s use of their content to train its algorithms (and presumably compete with them). Oh, the irony.

Perplexity and the Perplexing Legalities of Data Scraping

Of the many lawsuits media giants have filed against AI companies for copyright infringement, the one filed by Dow Jones & Co. (publisher of the Wall Street Journal) and NYP Holdings Inc. (publisher of the New York Post) against Perplexity AI adds a new wrinkle. 

Perplexity is a natural-language search engine that generates answers to user questions by scraping information from sources across the web, synthesizing the data and presenting it in an easily-digestible chatbot interface. Its makers call it an “answer engine” because it’s meant to function like a mix of Wikipedia and ChatGPT. The plaintiffs, however, call it a thief that is violating Internet norms to take their content without compensation. 

To me, this represents a particularly stark example of the problems with how AI platforms are operating vis-a-vis copyrighted materials, and one well worth analyzing.

According to its website, Perplexity pulls information “from the Internet the moment you ask a question, so information is always up-to-date.” Its AI seems to work by combining a large language model (LLM) with retrieval-augmented generation (RAG — oh, the acronyms!). As this is a blog about the law, not computer science, I won’t get too deep into this but Perplexity uses AI to improve a user’s question and then searches the web for up-to-date info, which it synthesizes into a seemingly clear, concise and authoritative answer. Perplexity’s business model appears to be that people will gather information through Perplexity (paying for upgraded “Pro” access) instead of doing a traditional web search that returns links the user then follows to the primary sources of the information (which is one way those media sources generate subscriptions and ad views).

Part of this requires Perplexity to scrape the websites of news outlets and other sources. Web scraping is an automated method to quickly extract large amounts of data from websites, using bots to find requested information by analyzing the HTML content of web pages, locating and extracting the desired data and then aggregating it into a structured format (like a spreadsheet or database) specified by the user. The data acquired this way can then be repurposed as the party doing the gathering sees fit. Is this copyright infringement? Probably, because copyright infringement is when you copy copyrighted material without permission. 

To make matters worse, at least according to Dow Jones and NYP Holdings, Perplexity seems to have ignored the Robots Exclusion Protocol. This is a tool that, among other things, instructs scraping bots not to copy copyrighted materials. However, despite the fact that these media outlets deploy this protocol, Perplexity spits out verbatim copies of some of the Plaintiff’s articles and other materials. 

Of course, Perplexity has a defense, of sorts. Its CEO accuses the Plaintiffs and other media companies of being incredibly short sighted, and wishing for a world in which AI didn’t exist. Perplexity says that media companies should work with, not against, AI companies to develop shared platforms. It’s not entirely clear what financial incentives Perplexity has or will offer to these and other content creators. 

Moreover, it seems like Perplexity is the one that is incredibly shortsighted. The whole premise of copyright law is that if people are economically rewarded they will create new, useful and insightful (or at least, entertaining) materials. If Perplexity had its way, these creators would not be paid at all or accept whatever it is that Perplexity deigns to offer. Presumably, this would not end well for the content creators and there would be no more reliable, up-to-date information to scrape. Moreover, Perplexity’s self-righteous claim that media companies just want to go back to the Stone Age (i.e., the 20th century) seems premised on a desire for a world in which the law allows anyone who wants copyrighted material to just take it without paying for it. And that’s not how the world works — at least for now.