Engineer runs 10-day experiment coding entirely on tiny local AI models

·1 views

A software developer spent ten days testing whether small local AI models — specifically a 2-billion-parameter Gemma model running on a Jetson Orin Nano — could replace cloud-based coding assistants like Claude Code. The experiment revealed that roughly 60% of early failures were caused by the harness discarding correct code due to broken indentation, not by the model itself being incapable. Fixing that single parsing issue raised the benchmark score from 64 to 76 out of 100. The developer also found that small models perform far better when given bounded, slot-filling tasks rather than open-ended planning, and that self-review loops — where the model judges its own output — actually degraded performance at this scale. The findings suggest that thin tooling around small models, rather than the models themselves, is often the primary bottleneck in agentic coding tasks.

Read the full story at DEV Community

This is an AI-generated summary. ShortSingh links to the original source for the complete article.

Discussion (0)

Minecraft's anvil 'Too Expensive' error is a tree optimization problem in disguise

A software developer discovered that Minecraft's anvil system punishes players not for what enchantments they add, but for the order in which they combine items. Two core rules drive the cost: enchantments carry a base level price, and each time an item passes through an anvil it accumulates an exponentially growing prior-work penalty. This means combining four enchanted books onto a tool one at a time — a natural approach — can hit the game's hard cost cap, while the same books merged in a balanced, tree-like sequence often succeeds. The structure mirrors classic computer-science problems such as optimal-merge and Huffman coding, where the goal is to arrange pairwise operations so costly steps occur as early and as shallowly as possible. The author argues the problem is complex enough that it warrants a dedicated tool rather than mental calculation.

0 comments Read more at DEV Community

ProgrammingDEV Community ·

Developer finds his own AI UI ruleset was generating the flaws it aimed to prevent

A developer who maintains StyleSeed, an open-source design ruleset for AI coding tools like Claude Code and Cursor, spent months cataloguing visual patterns that make AI-generated interfaces look generic. While testing his own rules, he discovered that one instruction was actively telling agents to use varied colors in status lists — the exact flaw the project was meant to eliminate. Running StyleSeed's quality gate on its own landing page returned a score of 58 out of 100, exposing multiple violations including icon chips, excess accent colors, and small text. After fixing the contradictory rules and rebuilding the landing page, the developer concluded that design coherence guidelines must be applied universally, including to the tools and pages promoting them. He also warns that as first-generation AI UI tells get patched, agents are converging on a new set of recognizable patterns, such as ghost index numbers and oversized KPI cards, which risk becoming the next generation of tells.

0 comments Read more at DEV Community

ProgrammingDEV Community ·

Feed Validation Errors Back to LLMs to Fix Structured Output Failures

When an LLM returns structured output that fails schema validation, a plain retry offers little improvement since it repeats the same prompt under similar conditions. A more effective approach involves passing the validation error and the model's previous flawed response back into the next prompt, prompting the model to edit rather than regenerate. This technique was applied in a RAG platform, where a retry loop formats the validation exception into plain-language instructions and supplies the prior output as the object to correct. The method typically preserves already-correct fields while fixing only the specific field that failed, improving success rates significantly. Key caveats include capping retry attempts to control costs and ensuring a fallback for cases where the original response is too malformed to serialize back into the prompt.

0 comments Read more at DEV Community

ProgrammingDEV Community ·

Developer Builds Unity APK Parser to Decode Three Conflicting Build Size Metrics

A developer working on a Unity build-size experiment discovered that Unity's build pipeline reports three different size figures for the same Android APK: the actual compressed file size (17.10 MiB), the BuildReport summary total (143.54 MiB), and the sum of packed asset entries (5.60 MiB). Each metric measures a different aspect of the build output, which initially caused confusion about which number reflects what users actually download. To resolve this, the developer wrote a custom BuildReport parser using Unity's IPostprocessBuildWithReport interface, generating a repeatable text report after every build. The tool ranks packed assets by size, groups duplicate source entries, and records all three metrics side by side to prevent mix-ups. The full source code has been published on GitHub as part of an ongoing Build Analyzer series.

0 comments Read more at DEV Community