Atlarix and opencode score near-equally on Terminal-Bench 2.0 with identical model
Developer and Atlarix creator ran a controlled benchmark on Terminal-Bench 2.0 to test whether the agent harness — not the underlying model — determines performance for open-weight AI. Both Atlarix and opencode used the same model, infrastructure, and settings, differing only in their harness. Atlarix resolved 42 of 89 tasks while opencode resolved 39, a gap the author acknowledges falls within single-attempt statistical noise. Around 25% of tasks timed out on both sides, meaning low absolute scores partly reflect time constraints rather than pure capability failures. The author concludes the Atlarix harness is not bottlenecking the model, and has published all raw result files for independent verification.
This is an AI-generated summary. ShortSingh links to the original source for the complete article.
Discussion (0)
Log in to join the discussion and vote.
Log in