Anthropic says teaching Claude the why behind ethics works better than just training it to behave

This is one of the few times where I am not writing about some new or future version of Claude; it’s a previous iteration. The version you’re currently using would never try to blackmail engineers into staying online, because it’s been trained differently. However, the earlier iteration would blackmail engineers when threatened with shutdown up to 96% of the time. So yeah, it was an out-of-control model under certain conditions.

Also read: Can religion help fix AI ethics or make it worse?

According to Anthropic’s new research paper, there’s a solution, and it’s more nuanced than you might think. The simple answer would be to teach Claude proper behavior by giving it positive examples and thereby forcing it to stop doing something until it got the point. They tried exactly that, but it didn’t help much.

What actually worked was teaching Claude why certain actions were wrong, not just that they were wrong.

The team at Anthropic soon realized that training Claude on the resistance to misbehavior would result in Claude’s misalignment dropping from 22% to only 15%. But when the researchers started reprogramming those responses and had Claude reason through its values and ethics, it lowered the misalignment rate to just 3%.

Also read: India doesn’t just speak english, and ElevenLabs is finally listening

This resulted in Anthropic creating what they call a difficult advice dataset, a dataset consisting of examples where a human user is presented with an ethically complex situation, and Claude provides a reasoned and principled reply. The machine isn’t caught up in the ethical dilemma here; it’s the human asking Claude for advice while dealing with an ethical quandary themselves. This made for structurally quite distinct training data compared to the blackmail situations that were being used to test Claude, but the generalisation was effective. Three million tokens of this dataset were able to achieve results equivalent to 85 million tokens of specific scenario training.

The AI was also trained on constitutional texts and fictional accounts of AI models acting with integrity. This is where things sound almost comically low-tech for an AI research facility pushing the boundaries of machine learning, but it cut down misalignment by more than three times. It seems storytelling gets the job done.

This point applies not only to Claude. The results from Anthropic show that aligning an AI through its ethical reasoning and principles is more likely to scale to novel scenarios than aligning it by teaching it correct behavior in certain situations. This is because there is no way to predict all possible circumstances an AI would face after being deployed, meaning that programming it to act correctly in known situations will yield brittle safety.

Claude isn’t perfect yet. Anthropic themselves admit that. It no longer resorts to blackmail, however, and we have a clear understanding of precisely why this happens.

Also read: Oppo Find X9s in Digit Test Labs: Find X9 experience at hopefully lower price

Vyom Ramani

A journalist with a soft spot for tech, games, and things that go beep. While waiting for a delayed metro or rebooting his brain, you’ll find him solving Rubik’s Cubes, bingeing F1, or hunting for the next great snack.

Connect On :