SShortSingh.
Back to feed

Local LLM qwen3-coder:30b Scores 22.8 vs Claude's 89.4 in Real Agent Benchmark

0
·3 views

A developer benchmarked qwen3-coder:30b against Claude by replaying 27 real historical tasks through Jarvis, a personal AI agent built on LangGraph with roughly 90 tools covering email, calendar, files, and code. Claude averaged a quality score of 89.4 out of 100 while qwen3-coder:30b averaged just 22.8, underperforming across all seven task categories. The local model was approximately 5,150 times cheaper per task, costing $0.00015 in GPU electricity versus $0.763 in API fees for Claude. qwen3-coder:30b also showed reliability issues, leaking malformed tool-call tags in 26% of responses and selecting the correct tools only 14.8% of the time. The author notes a potential self-preference bias since a Claude model was used as the judge, but argues it does not account for the 66-point quality gap or the high malformed-output rate.

Read the full story at DEV Community

This is an AI-generated summary. ShortSingh links to the original source for the complete article.

Discussion (0)

Log in to join the discussion and vote.

Log in

Related stories

0
ProgrammingDEV Community ·

Best UI Kits for Chrome Extensions in 2026: Why Web App Rankings Don't Apply

A developer at ExtensionBooster tested and ranked the top UI kits specifically for Chrome extensions, finding that popular choices like MUI, Chakra, and Mantine are optimised for standard web apps and often fail in extension environments. Chrome's Manifest V3 enforces strict Content Security Policies that block runtime CSS-in-JS libraries such as Emotion and styled-components from rendering correctly. Content scripts injected into third-party pages also face CSS bleed issues, making Shadow DOM compatibility a critical factor when choosing a UI kit. Bundle size is another key concern, as large component libraries can slow down a popup's first paint for end users. The author rebuilt the same extension popup, options page, and content-script overlay across multiple kits to produce a ranking tailored to these four extension-specific constraints.

0
ProgrammingDEV Community ·

Cart Timer Killing Live Payments? The Fix Is Simpler Than You Think

A recurring bug across e-commerce platforms causes customers to lose their orders when a cart reservation timer expires mid-payment, even as their bank transaction succeeds. The timer's original purpose is to manage inventory contention between undecided customers, but that logic becomes irrelevant once a buyer enters the payment flow. Engineers who treat a zero-timer as an unconditional release trigger are following system rules correctly yet ignoring the real-world outcome for the paying customer. Developers argue that a successful payment authorization should always override an expired hold, since rejecting an already-approved charge forces a refund cycle that damages customer trust without protecting any other buyer. The true exception is genuine oversell — where another customer completed purchase first — which should be handled as an inventory failure with an instant refund, not framed as the customer's error.

0
ProgrammingDEV Community ·

Docker Healthchecks Confirm Process Response, Not Application Health

Docker's HEALTHCHECK instruction periodically runs a command inside a container and marks it healthy or unhealthy based solely on the exit code returned. The mechanism does not read response bodies, parse JSON, or verify whether dependencies like databases or queues are functioning correctly. Most common implementations simply confirm that a server process is accepting TCP connections, which can mask deeper failures such as exhausted connection pools or expired API tokens. When teams equate a green 'healthy' status with full application health, they tend to overlook logs, metrics, and other diagnostic signals. The real risk lies not in what the healthcheck measures, but in the broader assumptions developers and operators build around its limited output.

0
ProgrammingDEV Community ·

Docker healthchecks only check process response, not real app health

Docker's HEALTHCHECK instruction runs a periodic command inside a container and marks it healthy or unhealthy based solely on the exit code returned. The mechanism does not read response bodies, interpret JSON, or verify whether dependencies like databases or queues are actually functioning. A common pattern — a curl call to an endpoint that returns a static 200 OK — only confirms the process is accepting TCP connections, not that business logic is working. This gap becomes costly when teams treat a green 'healthy' status as a full signal of application health, reducing attention to logs and richer metrics. Developers bear full responsibility for implementing endpoints that genuinely probe critical dependencies if they want the healthcheck to reflect real application state.