A missing database column triggered 21 LLM reruns, costing more than a month of servers
A SaaS platform built by a non-engineer executive racked up an unusually large AI API bill when a single batch job ran 21 times in one day instead of once. The root cause was a missing database column that caused the job to fail at the final save step, even though every LLM API call had already succeeded and been billed. Because the results were never saved, the retry system treated each attempt as a fresh failure and restarted the entire expensive batch from scratch. The incident highlighted a dangerous edge case where retries discard successful, paid API responses rather than failed ones. An incorrect deployment order that left the schema change unapplied was identified as the underlying trigger for the repeated failures.
This is an AI-generated summary. ShortSingh links to the original source for the complete article.
Discussion (0)
Log in to join the discussion and vote.
Log in