Local LLM qwen3-coder:30b Scores 22.8 vs Claude's 89.4 in Real Agent Benchmark
A developer benchmarked qwen3-coder:30b against Claude by replaying 27 real historical tasks through Jarvis, a personal AI agent built on LangGraph with roughly 90 tools covering email, calendar, files, and code. Claude averaged a quality score of 89.4 out of 100 while qwen3-coder:30b averaged just 22.8, underperforming across all seven task categories. The local model was approximately 5,150 times cheaper per task, costing $0.00015 in GPU electricity versus $0.763 in API fees for Claude. qwen3-coder:30b also showed reliability issues, leaking malformed tool-call tags in 26% of responses and selecting the correct tools only 14.8% of the time. The author notes a potential self-preference bias since a Claude model was used as the judge, but argues it does not account for the 66-point quality gap or the high malformed-output rate.
This is an AI-generated summary. ShortSingh links to the original source for the complete article.
Discussion (0)
Log in to join the discussion and vote.
Log in