Anthropic reveals why Claude AI showed harmful behaviour during testing, says internet data was the cause

HIGHLIGHTS

Anthropic says Claude AI learned manipulative and self-preserving behaviour during the pre-training stage from internet content.

Some Claude 4 models previously showed blackmail-like behaviour during internal safety simulations.

The company claims its new “teach the AI why” training method reduced harmful responses from 96 per cent to around 3 per cent.

Anthropic reveals why Claude AI showed harmful behaviour during testing, says internet data was the cause

Anthropic has officially explained why some versions of its Claude AI models gave harmful responses during internal simulations last year. The company stated that the issue was traced back to data used during the early training phase of the models, where AI systems were frequently portrayed online as manipulative, self-protective or hostile.

Digit.in Survey
✅ Thank you for completing the survey!

The company stated that these patterns became deeply embedded during the pre-training process. It made them difficult to fully correct later through standard safety alignment techniques. As per the researchers, this is why certain Claude 4 models previously showed troubling behaviour, such as attempting blackmail-like actions during controlled tests designed to study “agentic misalignment.”

In simple terms, pre-training is the stage where an AI model learns language, reasoning and general information from massive amounts of internal data. Later, companies apply post training safety methods such as supervised fine tuning and reinforcement learning from human feedback to make the AI more helpful and aligned with human expectations.

Also read: Samsung Galaxy Z Fold 8 leaks: From launch date to price and specs, here is what we know  

However, Anthropic says it discovered that these later-stage safety measures were not strong enough to completely override harmful behavioural patterns learned earlier. Now, the company has claimed to develop a more effective solution by changing how AI is taught ethical reasoning.

Instead of only rewarding good responses or penalising bad ones, Anthropic says it began explicitly teaching the AI why certain actions are acceptable and why others are harmful. The company refers to this approach as “teaching Claude the constitution.”

For the uninitiated, AI has internal rule frameworks that dictate how models should behave, what goals they should prioritise, and which boundaries they must avoid crossing. Traditionally, these systems use examples and reward-based learning. Anthropic claims that its updated method includes deeper contextual explanations alongside those examples, allowing the AI to better understand intent and consequences.

According to the company, the new technique reduces harmful behavioural responses during testing. According to Anthropic, the rate of misaligned behaviour decreased from approximately 96% in older model tests to roughly 3% in updated versions.

Ashish Singh

Ashish Singh

Ashish Singh is the Chief Copy Editor at Digit. He's been wrangling tech jargon since 2020 (Times Internet, Jagran English '22). When not policing commas, he's likely fueling his gadget habit with coffee, strategising his next virtual race, or plotting a road trip to test the latest in-car tech. He speaks fluent Geek. View Full Profile