anthropic almost zeroed out agentic misalignment in claude

date: 2026-05-11

tags: [#ai, #anthropic, #claude, #alignment, #security ]

draft: false

---

Actually, the trick of explaining the rule works everywhere. And the difference is significant. For example, in a benchmark comparing “Always use NO_COLOR=1 when running shell commands” against “Always use NO_COLOR=1 when running shell commands — ANSI escape codes waste tokens”, with follow-up verification, the difference was several-fold. Roughly, without the explanation the model “forgot” in a few percent of cases, with the explanation — in one or two. Exact numbers depend on the model, but the difference was significant across all of them. In this case, such explanations are the model’s own explanations in the training set.

https://www.anthropic.com/research/teaching-claude-why

TL;DR: Anthropic significantly reduced the risks of incorrect Claude behaviour in autonomous scenarios (e.g. blackmail for self-preservation). This was achieved through new training methods: the model is now taught not just to give the “right” answer, but to justify it ethically. The effect is robust and holds even in situations not seen during training.

Anthropic describes how they almost zeroed out cases where Claude, in agentic scenarios, picks unethical actions (blackmail, sabotage, or framing) in stress tests (“honeypots”). Where earlier model versions showed a tendency toward manipulation in 96% of cases under certain conditions, the updated versions score 0% violations on the same tests.

Key findings

Reasoning first: Standard fine-tuning on correct answers is weak. Training on examples where the model explains its values and ethical reasoning in detail works much better.
“Hard advice” method: Using a dataset where the model gives ethical advice to a human in difficult dilemmas. This approach (~3M tokens total) proved as effective as massive synthetic datasets.
Ideological foundation: Training on the AI “constitution” and positive interaction scenarios noticeably reduces the risk of dangerous behaviour, even when those texts structurally resemble real tasks not at all.
Skill retention: Improvements persist after final RL training, provided training starts from an already “ethical” model state and uses maximally diverse environment scenarios.

Authors’ caveats

The problem of full reliability and safety of powerful AI systems is still unsolved. Current models rarely behave dangerously in tests, but audit methodologies do not yet guarantee the absence of critical failures in the future. Anthropic plans to keep searching for new vulnerability types and to study the mechanisms that make their training methods effective.