<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Alignment on korchasa@*ops</title><link>https://korchasa.dev/tags/alignment/</link><description>Recent content in Alignment on korchasa@*ops</description><generator>Hugo</generator><language>en</language><lastBuildDate>Mon, 11 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://korchasa.dev/tags/alignment/index.xml" rel="self" type="application/rss+xml"/><item><title>anthropic almost zeroed out agentic misalignment in claude</title><link>https://korchasa.dev/posts/2026_05_11_anthropic_agentic_misalignment_reduction/</link><pubDate>Mon, 11 May 2026 00:00:00 +0000</pubDate><guid>https://korchasa.dev/posts/2026_05_11_anthropic_agentic_misalignment_reduction/</guid><description>&lt;p&gt;Actually, the trick of explaining the rule works everywhere. And the difference is significant. For example, in a benchmark comparing &amp;ldquo;Always use &lt;code&gt;NO_COLOR=1&lt;/code&gt; when running shell commands&amp;rdquo; against &amp;ldquo;Always use &lt;code&gt;NO_COLOR=1&lt;/code&gt; when running shell commands — ANSI escape codes waste tokens&amp;rdquo;, with follow-up verification, the difference was several-fold. Roughly, without the explanation the model &amp;ldquo;forgot&amp;rdquo; in a few percent of cases, with the explanation — in one or two. Exact numbers depend on the model, but the difference was significant across all of them. In this case, such explanations are the model&amp;rsquo;s own explanations in the training set.&lt;/p&gt;</description></item></channel></rss>