<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Benchmarks on korchasa@*ops</title><link>https://korchasa.dev/tags/benchmarks/</link><description>Recent content in Benchmarks on korchasa@*ops</description><generator>Hugo</generator><language>en</language><lastBuildDate>Fri, 12 Jun 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://korchasa.dev/tags/benchmarks/index.xml" rel="self" type="application/rss+xml"/><item><title>Yet another batch of meaningless autonomous agent benchmarks :)</title><link>https://korchasa.dev/posts/2026_06_12_agents_comparison_benchmarks/</link><pubDate>Fri, 12 Jun 2026 00:00:00 +0000</pubDate><guid>https://korchasa.dev/posts/2026_06_12_agents_comparison_benchmarks/</guid><description>&lt;p&gt;&lt;a href="https://github.com/korchasa/flowai-experiments/tree/main/agents-comparison" rel="nofollow noopener noreferrer external"&gt;https://github.com/korchasa/flowai-experiments/tree/main/agents-comparison&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src="https://korchasa.dev/posts/2026_06_12_agents_comparison_benchmarks/image.png" alt="Illustration: autonomous agent benchmarks"&gt;&lt;/p&gt;
&lt;p&gt;Burned 40% of my limit running benchmarks close to my real tasks on opus/fable/gpt-5.5 — fully autonomous agent work: app generation from scratch, a project audit, and three implementation tasks of varying difficulty.&lt;/p&gt;
&lt;p&gt;What can be said with at least some confidence:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;fable beats opus-4.8 and gpt-5.5 on result quality. My working hypothesis: fable medium = opus xhigh.&lt;/li&gt;
&lt;li&gt;opus xhigh is unexpectedly expensive due to overly long reasoning. Sometimes more expensive than fable.&lt;/li&gt;
&lt;li&gt;Looks are still a pain. Everything is dark-neon-identical.&lt;/li&gt;
&lt;li&gt;Proper testing will take 1-2 weekly limits on claude x20.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Hypotheses:&lt;/p&gt;</description></item></channel></rss>