Top AI Models Fail 2 in 3 Tax Returns, Experts Warn Against Financial Decision Role
A benchmark called TaxCalcBench tested leading AI models on 51 real 2024 US tax returns with verified IRS answers, finding that even the best performer, Gemini 2.5 Pro, answered correctly only 32% of the time under strict scoring. Claude Opus 4 and Sonnet 4 scored 27% and 23% respectively, with errors spanning wrong tax tables, arithmetic mistakes, and inconsistent outputs across repeated queries. The benchmark's authors concluded that deterministic tax calculation engines remain essential for tasks requiring consistent, auditable results. Analysts argue the core problem is structural: LLMs are probabilistic systems being used as if they were deterministic, making them unsuitable as final decision-makers on money-related rules like pricing, discounts, or tax eligibility. The proposed fix is a division of labour, where the LLM translates natural-language rules into a structured specification, while a separate deterministic engine handles the actual execution and final ruling.
This is an AI-generated summary. ShortSingh links to the original source for the complete article.

Discussion (0)
Log in to join the discussion and vote.
Log in