
It is no secret that training large language models requires vast amounts of data for the tech to synthesize language patterns. Yet, most of the content created in the past 60 years remains under the copyright of the original authors or their employees. Consequently, many AI companies have obtained the material from unauthorized sources. While the underlying lawsuits have been commencing for some time, we can now see the first results.
Yet, with each loss, the survival of AI and the underlying companies becomes more questionable. Thus, users have to decide whether they can trust companies that ignore copyrights or whether we should strive to build a more open and transparent future.
AI’s Copyright Problem
In 2024, Microsoft’s AI chief made a bold statement, asserting that any content on the Internet is fair game for AI. Even before issuing the claim, most AI companies had downloaded Terabytes of data to us for training AI. Consequently, Microsoft, GitHub, and OpenAI are facing multiple lawsuits related to AI training data. The accusations range from a breach of contract to copyright violations, underscoring the intricate nature of this issue. The plaintiffs range from individual open-source developers to the New York Times.
The issue is especially nefarious, as companies like Microsoft ruthlessly enforce copyright laws around their products. After all, most of Microsoft’s products, like Windows and OpenAI, are also available on the Internet in some form, protected by lengthy and complex terms and conditions. Yet, using these without paying will get you in hot water, which Deepseek might experience soon.
Yet, the copyright problem goes beyond the legal issue. It also touches the very fabric of our society. Our social contract defines stealing as being illegal and always has. Even supposedly “free” data comes with the stipulations. You agree to watch an ad to view a video, see a popup in a news feed, or agree to share your changes as open-source when using free software. AI doesn’t do any of this. Thus, it doesn’t give the creator anything in return for the content it consumes. Consequently, AI destroys the livelihood of researchers, professionals, and journalists to boost Big Tech’s bottom lines.
Creators Response To Copyright Violations
Legal actions were the initial step against AI developers. Unfortunately, Big Tech is fighting back viciously with sometimes questionable claims. Saying that AI is just “learning” like humans negates the complex processes within us and puts us on the same level as a profit-generating product.
Some creators have also reduced the content they put out openly. From CNN’s new paywall, via increasing requests to solve captchas, to more e-mail signup requests, we are experiencing changes in how content is presented and monetized on the Internet.
To AI-proof their sales pipelines, many companies outside the news businesses have moved away from thought leadership and migrating whitepapers and blog posts into their corporate newsletters. While that might protect sales and marketing, it reduces the amount of freely available material and the collaboration that builds on free information flows.
The Latest Copyright Case Developments
The latest decisions around AI copyrights might change the debate again. Notably, the decision by the U.S. District Court for the District of Delaware in Thomson Reuters v. Ross Intelligence found that using copyrighted materials to train an AI system constitutes direct copyright infringement. It thereby rejected the defendant’s fair use defense. Consequently, AI developers will find it harder to lay claim to the fair use exception when training AI, particularly when the AI system competes directly with the original work.
Similarly, in Intercept Media Inc. v. OpenAI Inc., the court allowed copyright claims to proceed against OpenAI. Intercept Media alleges that OpenAI removed copyright management information from news articles that train its models.
On the other side of the copyright issue, the U.S. Copyright Office keeps refining the copyright claims AI companies and users can make to protect AI-generated works. The agency has reinforced the principle that only works created by humans are eligible for copyright protection. Purely AI-generated works, lacking meaningful human input, are not copyrightable. Only if the human contribution is substantial and demonstrable can a work be protected by the copyright.
These decisions do not end the struggle between AI companies and copyright holders. Yet, they present a significant change in the weight and power between the two groups, and we will likely see more nuanced verdicts in the future.
A Path Forward
The recent AI copyright verdicts show that AI developers build on a precarious foundation. One wrong judgment and the whole house might become even more unsustainable. This risk represents a significant liability for companies building on OpenAI, Meta, Google, and Co APIs.
Going forward, we need a knowledge repository that is free from copyrighted material and free to use for genuine open-source AI. Such a free repository will allow us to build software and tools that are free of the risk of copyright judgment and, simultaneously, verifiable and reproducible. Thus, it would give us an AI that serves creators and the wider public instead of sucking the lifeblood out of everyone.