AI & Copyright

The US Copyright office has issued its latest opinion on AI and copyright:

https://natlawreview.com/article/copyright-offices-latest-guidance-ai-and-copyrightability

The U.S. Copyright Office’s January 2025 report on AI and copyrightability reaffirms the longstanding principle that copyright protection is reserved for works of human authorship. Outputs created entirely by generative artificial intelligence (AI), with no human creative input, are not eligible for copyright protection. The Office offers a framework for assessing human authorship for works involving AI, outlining three scenarios: (1) using AI as an assistive tool rather than a replacement for human creativity, (2) incorporating human-created elements into AI-generated output, and (3) creatively arranging or modifying AI-generated elements.

The office’s approach to use of models seems fairly reasonable to me.

I’m not so enthusiastic about the de facto policy for ingestion of copyrighted material for training models, which courts have ruled to be fair use.

https://www.arl.org/blog/training-generative-ai-models-on-copyrighted-works-is-fair-use/

On the question of whether ingesting copyrighted works to train LLMs is fair use, LCA points to the history of courts applying the US Copyright Act to AI. For instance, under the precedent established in Authors Guild v. HathiTrust and upheld in Authors Guild v. Google, the US Court of Appeals for the Second Circuit held that mass digitization of a large volume of in-copyright books in order to distill and reveal new information about the books was a fair use. While these cases did not concern generative AI, they did involve machine learning. The courts now hearing the pending challenges to ingestion for training generative AI models are perfectly capable of applying these precedents to the cases before them.

I get that there are benefits to inclusive data for LLMs,

Why are scholars and librarians so invested in protecting the precedent that training AI LLMs on copyright-protected works is a transformative fair use? Rachael G. Samberg, Timothy Vollmer, and Samantha Teremi (of UC Berkeley Library) recently wrote that maintaining the continued treatment of training AI models as fair use is “essential to protecting research,” including non-generative, nonprofit educational research methodologies like text and data mining (TDM). …

What bothers me is that allegedly “generative” AI is only accidentally so. I think a better term in many cases might be “regurgitative.” An LLM is really just a big function with a zillion parameters, trained to minimize prediction error on sentence tokens. It may learn some underlying, even unobserved, patterns in the training corpus, but for any unique feature it may essentially be compressing information rather than transforming it in some way. That’s still useful – after all, there are only so many ways to write a python script to suck tab-delimited text into a dataframe – but it doesn’t seem like such a model deserves much IP protection.

Perhaps the solution is laissez faire – DeepSeek “steals” the corpus the AI corps “transformed” from everyone else, commencing a race to the bottom in which the key tech winds up being cheap and hard to monopolize. That doesn’t seem like a very satisfying policy outcome though.

Get a lawyer

That’s really the only advice I can give on models and copyrights.

Nevertheless, here are some examples of contract language that may be illuminating. Bear in mind that I AM NOT A LAWYER AND THIS IS NOT LEGAL ADVICE. I provide no warranty of any kind and assume no liability for your use or misuse of these examples. There are lots of deadly details, regional differences, and variations in opinion about good contract terms. Also, these terms have been slightly adapted to conceal their origins, which may have unintended consequences. Get an IP lawyer to review your plans before proceeding.

Continue reading “Get a lawyer”

Models and copyrights

Or, Friends don’t let friends work for hire.

opencontent

Image Copyright 2004 Lawrence Liang, Piet Zwart Institute, licensed under a Creative Commons License

Photographers and other media workers hate work for hire, because it’s often a bad economic tradeoff, giving up future income potential for work that’s underpaid in the first place. But at least when you give up rights to a photo, that’s the end of it. You can take future photos without worrying about past ones.

For models and software, that’s not the case, and therefore work for hire makes modelers a danger to themselves and to future clients. The problem is that models draw on a constrained space of possible formulations of a concept, and tend to incorporate a lot of prior art. Most of the author’s prior art is probably, in turn, things learned from other modelers. But when a modeler reuses a bit of structure – say, a particular representation of a supply chain or a consumer choice decision – under a work for hire agreement, title to those equations becomes clouded, because the work-for-hire client owns the new work, and it’s hard to distinguish new from old.

The next time you reuse components that have been used for work-for-hire, the previous client can sue for infringement, threatening both you and future clients. It doesn’t matter if the claim is legitimate; the lawsuit could be debilitating, even if you could ultimately win. Clients are often much bigger, with deeper legal pockets, than freelance modelers. You also can’t rely on a friendly working relationship, because bad things can happen in spite of good intentions: a hostile party might acquire copyright through a bankruptcy, for example.

The only viable approach, in the long run, is to retain copyright to your own stuff, and grant clients all the license they need to use, reproduce, produce derivatives, or whatever. You can relicense a snippet of code as often as you want, so no client is ever threatened by another client’s rights or your past agreements.

Things are a little tougher when you want to collaborate with multiple parties. One apparent option, joint ownership of copyright to the model, is conceptually nice but actually not such a hot idea. First, there’s legal doctrine to the effect that individual owners have a responsibility not to devalue joint property, which is a problem if one owner subsequently wants to license or give away the model. Second, in some countries, joint owners have special responsibilities, so it’s hard to write a joint ownership contract that works worldwide.

Again, a viable approach is cross-licensing, where creators retain ownership of their own contributions, and license contributions to their partners. That’s essentially the approach we’ve taken within the C-ROADS team.

One thing to avoid at all costs is agreements that require equation-level tracking of ownership. It’s fairly easy to identify individual contributions to software code, because people tend to work in containers, contributing classes, functions or libraries that are naturally modular. Models, by contrast, tend to be fairly flat and tightly interconnected, so contributions can be widely scattered and difficult to attribute.

Part of the reason this is such a big problem is that we now have too much copyright protection, and it lasts way too long. That makes it hard for copyright agreements to recognize where we see far because we stand on the shoulders of giants, and distorts the balance of incentives intended by the framers of the constitution.

In the academic world, model copyright issues have historically been ignored for the most part. That’s good, because copyright is a hindrance to progress (as long as there are other incentives to create knowledge). That’s also bad, because it means that there are a lot of models out there that have not been placed in the public domain, but which are treated as if they were. If people start asserting their copyrights to those, things could get messy in the future.

A solution to all of this could be open source or free software. Copyleft licenses like the GPL and permissive licenses like Apache facilitate collaboration and reuse of models. That would enable the field to move faster as a whole through open extension of prior work. C-ROADS and C-LEARN and component models are going out under an open license, and I hope to do more such experiments in the future.

Update: I’ve posted some examples.