Will Copyright Doom Generative AI?
New large language models (LLMs) seem to appear every day, yet the controversy surrounding the legality of training them on copyrighted material continues to rage. Big Media is both threatened by generative artificial intelligence (AI) – chatbots that write novels and news copy, image generators that create artwork or music to order in the style of any artist whose work is accessible on the internet – and determined to grab a share of the wealth they generate. The list of their pending lawsuits against the titans of tech is long and their resolution is distant. And the techies are striking back: Meta, Google, OpenAI and others have asked the Trump administration to declare that it’s legally permissible to use copyrighted material to train AI models.
Before we consider whether copyright owners can prevent use of their content to train generative AI, let’s first ask whether they should be able to do so. The question is easier, or at least clearer, if we are prepared to attribute agency to a computer and judge its activities as if they were undertaken by humans. Of course, machines don’t think or create like humans – they just do what we tell them to do. Until very recently, it was easy to see computers as sophisticated tools subservient to human agency, regurgitating pre-loaded content and crunching numbers. Today, we converse with chatbots the way we would with a research or coding assistant, and with image generators the way art directors guide human illustrators and graphic designers.
Much as it discomfits us, generative AI learns and, at some level, “thinks.” Trained on a significant slice of human knowledge, ChatGPT aced the “Turing test” – the famous measure of a machine’s ability to exhibit human-like intelligent behavior – the day it was released. Since then, chatbots have passed the bar and medical licensing exams, solved long-standing math conundrums, and written more empathetic responses to patient questions than their doctors. They even outperform humans on tests of creativity, and it is precisely to encourage creativity that copyright laws exist.
It is because humans develop and benefit from generative AI that we must ask if there is any legal basis for treating AI differently under copyright law. Humans read books and newspapers to learn, to become more informed, and to become better writers. Humans bring sketchbooks to museums and record their impressions of works they see, improving artistic skills and broadening stylistic repertoires.
And humankind benefits by enriching the cognitive and creative capacities of generative AI. It helps us perform better. It boosts our capabilities, as long as we don’t forget how to think for ourselves. We expect doctors to keep up with the medical literature and lawyers to read the latest cases, so if we value the assistance AI provides, we should want to see it exposed to the broadest possible swath of human understanding.
Treating AI training as we would human learning and skill development tips the social balance, in my view, in favor of supporting AI training. But law and policy are different things – will courts side with copyright owners or with AI developers? The answer turns on the concept of “fair use,” which is how copyright law mediates between the rights of content owners and users. Fair use allows content to be exploited without permission for purposes like criticism, commentary, news reporting, teaching, scholarship, or research. What activities qualify as fair use is complicated by the territorial nature of copyright laws, meaning they apply within the borders of a specific country; different countries interpret and enforce them differently. Treaties such as the Berne Convention aim to harmonize copyright laws across countries but don’t eliminate differences. Israel and Japan, for example, have more permissive notions of fair use than the U.S., which is more permissive than Europe.
In most countries the jury is still out, so to speak, on whether the use of copyrighted material for AI training qualifies as fair use. The issue is contested in the U.S. (and the courts, not an executive order, will have the last word), ambiguous in China, and may be permissible only for non-commercial purposes in Europe. Israel’s Ministry of Justice has come down on the side of AI developers. Japan, one of the most AI-friendly countries, gives explicit legal sanction to use of copyrighted materials for “information analysis,” which includes training AI models.
An AI model trained in a permissive country can probably be used in any other country, even if the training would have violated copyright law in that country. Permissive countries, therefore, may grow popular as AI “safe harbors” if U.S. or Chinese courts decide against the tech titans.
A separate question, but one which also goes to the broad relevance of copyright law to generative AI, is whether what’s produced by a legitimately trained AI system might infringe someone’s copyright – e.g., if the literal content of works used during training leaks into the output. In its lawsuit, The New York Times cited instances of verbatim copying of its content by ChatGPT. Depending on how much was copied, those specific instances could represent copyright infringement regardless of whether the culprit is human or machine. (OpenAI, the proprietor of ChatGPT, insists such cases are rare and thinks their chatbot may have been tricked into copying.)
Artists have a tougher case to make because style has never been protectable by copyright. Today, anyone is free to hire an artist to create a work in the style of another artist. That may be disreputable but, as long as no specific work by the other artist is copied, it isn’t a copyright violation.
Returning to the original question concerning the legality of training LLMs, there seems little risk that copyright will halt or hobble the steady march of generative AI systems into every aspect of our lives. Even the inconvenience of different national copyright laws and policies can be sorted out by the market: large cloud-based hosting platforms, for example, might offer AI customers the option of training their systems on servers physically located in safe-harbor countries. Big Tech, of course, can afford to train AI models wherever it wants.
That alone makes it difficult to see how the training wars end well for Big Media. But the underlying economics are even harsher. Professionally written and edited content is more valuable for training LLMs, but these systems are so data-hungry that even the complete archives of any single content source will represent a tiny fraction of the training corpus; indeed, just avoiding various kinds of bias requires balancing professionally produced copy with informal or user-generated content (e.g., blogs, social media posts) so the LLM will recognize conversational styles, colloquialisms, and cultural trends. How much, really, are The New York Times archives worth to OpenAI?
The economics of LLMs may also prove harsh. It’s been estimated that generative AI can perform tasks associated with more than 80% of jobs in the U.S. But a recent report issued by the U.S. Bureau of Labor Statistics predicts that employment in professional, scientific and technical services will rise 10% over the next eight years notwithstanding the impact of generative AI. Right now, the vast majority of users pay little or nothing to use LLMs. The spectacular emergence of DeepSeek not only gives users a new choice; it also foreshadows many further new choices to come. With uncertain business demand and no shortage of LLM supply, how much will Big Tech ultimately profit from generative AI?
Should Big Media plaintiffs in the training lawsuits perceive business realities and legal uncertainties as working against them, they’ll settle for short money and lawyers will fret that the copyright picture remains murky. It won’t matter. The business of generative AI will easily flow around legal obstacles, and the overall economic picture may remain murkier than the legal one for some time.
(A different version of this posted appeared in AI in Plain English.)