Developer Builds AI-Maintained Failure Log to Close the ML Eval Feedback Loop

·1 views

A developer working on an MLX-based classifier that maps work sessions to Jira tickets found that running evaluations was easy, but tracking and diagnosing recurring failures was not. After accumulating 62 failures across three eval runs with no reliable way to spot patterns, they designed a structured solution using a Claude Code skill invoked manually after each evaluation. The workflow writes failure data to a machine-maintained file called FEEDBACK.json, storing runs, individual observations, and named failure classes that persist across multiple eval cycles. To keep context usage manageable, the skill queries only targeted slices of the file using jq rather than loading it entirely. The approach aims to turn evaluation results into an actionable engineering tool rather than a static scoreboard.

Read the full story at DEV Community

This is an AI-generated summary. ShortSingh links to the original source for the complete article.

Discussion (0)

Developer tool fimod aims to replace repetitive shell scripts in CI pipelines

A developer has published fimod, a lightweight command-line tool designed to handle small data transformation tasks in CI pipelines without resorting to ad-hoc Python snippets or complex shell scripting. The tool supports reading and writing JSON, YAML, and CSV formats, and accepts Python-like expressions to extract, reshape, or validate structured data. Among its built-in features are direct HTTPS URL fetching, regex helpers with named capture groups, dot-path access for nested fields, and SHA-256 hashing for data anonymization. The developer positions fimod not as a replacement for established tools like jq or yq, but as a reusable, portable utility for routine data-shaping tasks shared across repositories. The project is open source and available on GitHub under the handle pytgaen.

0 comments Read more at DEV Community

ProgrammingDEV Community ·

BuyWhere MCP Server Lets Shoppers Compare Prices Across 9 Countries in One Query

A developer tool called the BuyWhere MCP server enables real-time cross-border product price comparisons across nine countries and 11 million products through a single function call. The tool exposes a unified search interface that returns merchant names, prices, currencies, and product URLs from multiple marketplaces simultaneously. A task that previously required 15–20 minutes of manual browsing across platforms like Shopee, Lazada, and Amazon can now be completed in under three seconds. The server is compatible with popular AI development environments including Claude Desktop, Cursor, and VS Code, and can also be called via Python scripts. It eliminates the need to build separate API clients for each marketplace, making it particularly useful for developers integrating price comparison into AI agents or shopping tools.

0 comments Read more at DEV Community

ProgrammingDEV Community ·

Coolify Offers Self-Hosted PaaS Alternative to Vercel and Heroku at a Fraction of the Cost

Coolify is an open-source, self-hosted platform-as-a-service that lets developers deploy apps via Git push to their own servers, with automatic TLS certificates handled through Traefik and Let's Encrypt. It supports over 280 one-click services including databases and analytics, and works with build tools like Nixpacks, Dockerfile, and Docker Compose. The platform reached stable release with v4.0.0 in April 2026, adding features like Railpack support and audit logging in v4.1 the following month. A comparable Next.js app with Postgres and Redis that costs roughly $1,200 per year on Heroku can run on a Coolify-managed VPS for around €145 annually. However, unlike managed SaaS platforms, Coolify shifts operational responsibilities — including scaling, security patching, and uptime — entirely onto the user.

0 comments Read more at DEV Community

ProgrammingDEV Community ·

Anthropic Promotes New Claude Model With Stronger Coding and Agentic Capabilities

Anthropic staff members Thariq Shihipar and Cat Wu joined a roundtable hosted by Datasette founder Simon Willison to showcase the company's latest Claude model, referred to as Fable. The team highlighted significant improvements in agentic coding, claiming the model handles 50% more pull requests and can write its own test scripts to verify its output. Fable can operate autonomously for longer periods, delegate tasks to subagents, process visual data from images and graphs, and be monitored or controlled remotely via a separate device. A live demonstration saw the model independently configure a Microsoft Teams account, add contacts, draft a welcome message, and contact IT administrators — all within 30 minutes. Anthropic staff also shared personal use cases, including building a 2D fighting game and developing a mountain-climbing route planner, illustrating the model's versatility beyond professional tasks.

0 comments Read more at DEV Community

Developer Builds AI-Maintained Failure Log to Close the ML Eval Feedback Loop

Discussion (0)

Related stories

Developer tool fimod aims to replace repetitive shell scripts in CI pipelines

BuyWhere MCP Server Lets Shoppers Compare Prices Across 9 Countries in One Query

Coolify Offers Self-Hosted PaaS Alternative to Vercel and Heroku at a Fraction of the Cost

Anthropic Promotes New Claude Model With Stronger Coding and Agentic Capabilities