anthropic almost zeroed out agentic misalignment in claude

Mon, 11 May 2026 00:00:00 +0000

Actually, the trick of explaining the rule works everywhere. And the difference is significant. For example, in a benchmark comparing “Always use NO_COLOR=1 when running shell commands” against “Always use NO_COLOR=1 when running shell commands — ANSI escape codes waste tokens”, with follow-up verification, the difference was several-fold. Roughly, without the explanation the model “forgot” in a few percent of cases, with the explanation — in one or two. Exact numbers depend on the model, but the difference was significant across all of them. In this case, such explanations are the model’s own explanations in the training set.

Alignment on korchasa@*ops

anthropic almost zeroed out agentic misalignment in claude