Yet another batch of meaningless autonomous agent benchmarks :)

Fri, 12 Jun 2026 00:00:00 +0000

https://github.com/korchasa/flowai-experiments/tree/main/agents-comparison

Burned 40% of my limit running benchmarks close to my real tasks on opus/fable/gpt-5.5 — fully autonomous agent work: app generation from scratch, a project audit, and three implementation tasks of varying difficulty.

What can be said with at least some confidence:

fable beats opus-4.8 and gpt-5.5 on result quality. My working hypothesis: fable medium = opus xhigh.
opus xhigh is unexpectedly expensive due to overly long reasoning. Sometimes more expensive than fable.
Looks are still a pain. Everything is dark-neon-identical.
Proper testing will take 1-2 weekly limits on claude x20.

Hypotheses:

Benchmarks on korchasa@*ops

Yet another batch of meaningless autonomous agent benchmarks :)