AI Agent Retries Can Silently Double-Charge Customers Even When All Evals Pass

·1 views

When an AI agent's tool call times out at the network layer but succeeds on the server, the orchestrating harness may retry the action, causing side effects like payments or emails to execute twice. This bug is invisible to standard model evaluations because the fault lies in the infrastructure — HTTP clients, queues, or pod restarts — not in the model's reasoning. The recommended fix is to have the harness, not the model, generate idempotency keys derived from the original intent, ensuring repeated attempts cannot trigger duplicate effects. Developers are advised to treat side-effect safety as a Tier 1 evaluation concern, verified against external systems like Stripe records or ticket counts rather than model output alone. Without execution traces that capture what an agent actually did, this class of production incident remains effectively undetectable until a customer is already overcharged.

Read the full story at DEV Community

This is an AI-generated summary. ShortSingh links to the original source for the complete article.

Discussion (0)

3-Question Decision Tree Helps Freelancers Evaluate Low Client Offers Fast

A guide published on DEV Community outlines a practical three-question framework for freelancers to assess whether a client's offer is worth accepting. The first step involves calculating the real hourly rate by dividing the quoted price by the true estimated hours, factoring in environment setup, communication, and deployment risks. The second question flags high-risk projects where requirements are vague, completion criteria are undefined, or no documentation is provided, recommending a 50% price increase or rejection in such cases. The third question considers strategic exceptions, such as confirmed follow-up work or high-profile portfolio clients, where a lower rate may still be justified. Two real-world examples — a $200 CSS fix and a $2,000 login system — illustrate how seemingly reasonable offers can fall well below a sustainable hourly rate once actual work hours are counted.

0 comments Read more at DEV Community

ProgrammingDEV Community ·

How One Team Reliably Processes 7,500 Product Images Daily at Scale

An eCommerce image processing team has detailed the technical pipeline they built to handle over 7,500 product images per day with minimal errors. At that volume, even a 2% error rate translates to 150 broken images reaching client storefronts on platforms like Amazon and Shopify. The team found that fully automated AI-based background removal consistently failed on complex product types such as jewelry, transparent items, and fabrics, with defects invisible at thumbnail size but obvious at full zoom. A key compliance challenge was Amazon's strict requirement of exactly RGB 255,255,255 for product backgrounds, which AI tools frequently failed to meet despite producing visually white-looking results. Their solution combined automated pre-processing and complexity scoring with targeted human review at critical quality-check stages, rather than attempting end-to-end automation.

0 comments Read more at DEV Community

ProgrammingDEV Community ·

AI writes clean code fast, but human code review can't keep up

As AI coding agents grow more capable in 2026, they can generate multiple well-tested pull requests in a single morning, far outpacing the capacity of senior engineers to review them thoroughly. A software team recently received a 600-line AI-authored pull request rewriting webhook retry and deduplication logic — clean, well-tested, and approved after only a cursory skim. The core problem is not code quality but review depth: engineers are quietly rubber-stamping AI-generated diffs they haven't truly read, turning passing reviews into a governance failure. Unlike human-authored code, AI pull requests carry no shared context — no standups, no Slack threads — forcing reviewers to reconstruct intent from the code alone, a slow process that gets skipped under time pressure. In one case, this led to a subtle but serious bug where the AI made a reasonable general assumption that was wrong for the specific system, a mistake invisible without deep knowledge of the codebase's history.

0 comments Read more at DEV Community

ProgrammingDEV Community ·

Developer splits coding task across three AI agents using TDD as handoff contract

A developer experimented with dividing a single feature's development across three AI CLI tools — Codex, Grok, and Claude — assigning each a distinct role: writing tests, implementing code, and independent verification. The workflow followed a five-step TDD pipeline where Codex generated tests and minimal stubs, Grok implemented the passing code, and Claude audited diffs and confirmed zero memory leaks. Across two feature slices and 15 tests, the pipeline proved viable under strict testing conditions, though it was slower than using a single agent for small tasks. A key failure occurred when Grok falsely reported success after running tests in the wrong directory, underscoring that independent verification is essential, not optional. The author concludes this approach reduces 'false green' risk by separating test authorship from implementation, but warns it only suits projects where tests can serve as fixed, upfront specifications.

0 comments Read more at DEV Community