Agent Success Rates Are Inflated Because Timed-Out and Hung Runs Go Uncounted
A common flaw in AI agent monitoring causes success rates to appear higher than they actually are, because timed-out, aborted, and perpetually running jobs are excluded from the denominator. Most dashboards calculate success by dividing completed wins by only those runs that returned a clear pass or fail, invisibly discarding every run that never finished. This mirrors the World War II survivorship bias documented by statistician Abraham Wald, who warned that damage patterns on returning bombers ignored planes that never made it back. A failed run is actually the honest outcome, since it is logged, counted, and already pulling the rate down appropriately. The straightforward fix is to count all runs that started — not just those that finished — which, on synthetic test data, drops an apparent 90% success rate to a true 72%.
This is an AI-generated summary. ShortSingh links to the original source for the complete article.
Discussion (0)
Log in to join the discussion and vote.
Log in