Microsoft accused of using 2 lakh copyrighted books for AI training: Here’s what happened

HIGHLIGHTS

Microsoft faces lawsuit for allegedly using 200,000 pirated books to train its AI models.

Authors claim AI mimics their voice, sparking copyright battle over training data legality.

Case could redefine fair use and reshape how tech giants train AI with written works.

Microsoft accused of using 2 lakh copyrighted books for AI training: Here’s what happened

The race to dominate the AI sector is constantly evolving, but it seems Microsoft may have taken a few creative shortcuts, and a growing group of authors isn’t letting it slide.

Digit.in Survey
✅ Thank you for completing the survey!

On June 25, a lawsuit filed in a New York federal court accused the tech giant of using over 200,000 pirated books to train its AI models. The plaintiffs? A formidable lineup of writers including Kai Bird (Pulitzer Prize-winner), Jia Tolentino (New Yorker staffer), and Daniel Okrent (former NYT public editor). The charge is that Microsoft trained its powerful AI on their copyrighted works without permission, payment, or even a heads-up.

The case strikes at a fundamental tension in the AI era: how do you teach machines to understand language without violating the rights of those who create it?

Also read: Meta wins AI copyright case filed by authors over book use

The Allegation: A literary data heist

The lawsuit claims Microsoft relied on a shadow dataset filled with pirated books, digital versions of published works scraped from the web. This content, according to the complaint, was used to fine-tune powerful large language models like Megatron and possibly others under Microsoft’s umbrella.

Unlike tech manuals or public domain novels, the plaintiffs say these were contemporary, copyrighted books, and the AI’s output often mimics their structure, tone, and narrative style. One example cited in the suit allegedly shows AI-generated text that reflects the distinctive voice of a plaintiff author, suggesting the model didn’t just learn from books, it absorbed them. The authors are demanding an injunction to stop further use of their works, plus damages that could hit $150,000 per title. Multiplied across thousands of books, the figure could balloon into the billions.

This lawsuit comes on the heels of a significant ruling in California just a day earlier. In a case involving Anthropic, another AI firm, a judge ruled that training on lawfully obtained content might qualify as fair use but pirated works definitely don’t. That precedent may now come back to haunt Microsoft.

Also read: Fair use vs copyright: Anthropic’s case and its impact on AI training

The court drew a key line: it’s one thing to train a model using licensed or publicly available texts. But once AI companies dip into the vast sea of pirated literature online, they cross into clearly illegal territory.

Why this case matters

So far, the AI industry has largely operated in a legal grey zone. Developers argue that ingesting vast amounts of text, images, and code is necessary for building capable model and that doing so is covered under fair use laws. But creators say it’s outright theft.

This lawsuit joins a rising tide of legal action against AI companies. The New York Times is suing Microsoft and OpenAI in a landmark case. Comedian Sarah Silverman and other authors have launched suits against Meta and OpenAI. And musicians and visual artists are demanding AI companies stop using their work to generate lookalikes.

For Microsoft, the stakes are enormous. If courts start siding with creators, it may force the company and its competitors to fundamentally rethink how AI systems are trained. That could mean licensing fees, royalties, and legal accountability for every dataset, every model, every release.

Beyond the legal aspect, there is a broader question: What is creativity worth in the AI age? Authors spend years writing books. AI can churn out paragraphs in seconds. If machines are learning from writers but the writers are left out of the loop,financially and ethically,can the system ever be considered fair?

Whether Microsoft will settle, fight, or lose remains to be seen. But one thing is certain: this lawsuit could become a defining moment in the ongoing battle between creators and coders. And this time, the book isn’t closed yet.

Also read: Edge, Neon, Comet, Arc: Top AI-powered browsers you must try

Vyom Ramani

Vyom Ramani

A journalist with a soft spot for tech, games, and things that go beep. While waiting for a delayed metro or rebooting his brain, you’ll find him solving Rubik’s Cubes, bingeing F1, or hunting for the next great snack. View Full Profile

Digit.in
Logo
Digit.in
Logo