Top AI Models Fail 2 in 3 Tax Returns, Experts Warn Against Financial Decision Role

·2 views

A benchmark called TaxCalcBench tested leading AI models on 51 real 2024 US tax returns with verified IRS answers, finding that even the best performer, Gemini 2.5 Pro, answered correctly only 32% of the time under strict scoring. Claude Opus 4 and Sonnet 4 scored 27% and 23% respectively, with errors spanning wrong tax tables, arithmetic mistakes, and inconsistent outputs across repeated queries. The benchmark's authors concluded that deterministic tax calculation engines remain essential for tasks requiring consistent, auditable results. Analysts argue the core problem is structural: LLMs are probabilistic systems being used as if they were deterministic, making them unsuitable as final decision-makers on money-related rules like pricing, discounts, or tax eligibility. The proposed fix is a division of labour, where the LLM translates natural-language rules into a structured specification, while a separate deterministic engine handles the actual execution and final ruling.

Read the full story at DEV Community

This is an AI-generated summary. ShortSingh links to the original source for the complete article.

Discussion (0)

Waag Moves Bluesky Data to European-Hosted Eurosky Instance

Dutch public technology institute Waag has migrated its Bluesky social media data to Eurosky, a European-hosted alternative instance of the platform. The move reflects growing concerns among European organizations about data sovereignty and reliance on US-based digital infrastructure. Eurosky operates on the AT Protocol, the same open standard underlying Bluesky, allowing interoperability while keeping data within European jurisdiction. Waag published an article explaining its reasoning, citing alignment with its values around open, publicly governed technology. The decision mirrors a broader trend of European institutions seeking greater control over their digital presence and user data.

0 comments Read more at Hacker News

ProgrammingDEV Community ·

Oracle PeopleSoft Vulnerabilities Exploited in Attack on Nissan and 100+ Firms

A coordinated cyberattack exploiting vulnerabilities in Oracle PeopleSoft has compromised more than 100 organizations, including Nissan, exposing sensitive employee data. Attackers leveraged known flaws in PeopleSoft's Java deserialization handlers and HTTP endpoints to achieve remote code execution on application servers. Once inside, threat actors were able to harvest authentication tokens, LDAP credentials, password hashes, and OAuth secrets stored within the platform. Because PeopleSoft systems typically integrate with enterprise identity infrastructure such as Active Directory and cloud HR platforms, the breach creates pathways for lateral movement across connected networks. The campaign highlights the elevated risk posed by centralized identity management systems that hold privileged access to broader enterprise environments.

0 comments Read more at DEV Community

ProgrammingDEV Community ·

Developer builds AI agent to automate AWS-to-GKE app migration with human oversight

A software developer created an AI-powered tool called a 'skill' for the Antigravity CLI (agy) to automate the refactoring of cloud-dependent codebases from AWS to Google Kubernetes Engine (GKE). The tool addresses common migration pain points such as hardcoded AWS credentials, proprietary SDK usage like boto3, and local disk storage incompatible with ephemeral Kubernetes pods. It works by scanning cloud dependencies, spawning parallel subagents to refactor code and infrastructure, and validating changes on a local Kubernetes cluster before deployment. A mandatory human-in-the-loop (HITL) approval gate is built in to prevent any unsupervised changes from reaching production environments. The approach contrasts with simple scripted find-and-replace methods by using an LLM agent capable of understanding semantic context and adapting to the current state of the codebase.

0 comments Read more at DEV Community

ProgrammingDEV Community ·

Go's Built-In pprof Tool Lets Developers Profile Live Services in Minutes

Go includes a built-in profiling tool called pprof that requires no third-party software or agents to operate. Developers can enable it by importing the net/http/pprof package, which registers HTTP endpoints exposing CPU, memory, goroutine, and mutex data. A 30-second CPU sample can be collected using the go tool pprof command, and results can be visualized as flame graphs through a built-in web UI. Flame graphs help identify bottlenecks such as excessive memory allocations, JSON serialization overhead, or lock contention by showing which functions consume the most CPU time. For security, pprof endpoints should only be bound to localhost or a private interface, never exposed on a public port.

0 comments Read more at DEV Community