CTO cuts LLM costs from $4,800 to $120/month by switching models, no code rewritten

·1 views

A CTO at an unnamed company reduced their monthly AI inference bill by roughly 40 times — from $4,800 to under $200 — without any code changes or customer-facing disruption. The company had been using OpenAI's GPT-4o for summarization, a customer support copilot, and internal tools, paying $2.50 per million input tokens and $10 per million output tokens. After evaluating several alternative models, the CTO found DeepSeek V4 Flash offered comparable quality at just $0.18 per million input and $0.25 per million output tokens. A blind A/B test on 500 production prompts confirmed that DeepSeek V4 Flash performed within statistical noise of GPT-4o on summarization tasks. The CTO noted that the migration required no new routing logic or fallback code, and that meaningful cost savings were available even within OpenAI's own model lineup via GPT-4o-mini.

Read the full story at DEV Community

This is an AI-generated summary. ShortSingh links to the original source for the complete article.

Discussion (0)

Persism 2.4 Released: Lightweight Auto-Configuration Java ORM Library

Persism 2.4 is a Java ORM library designed to minimize setup complexity through auto-discovery and auto-configuration. It follows a convention-over-configuration approach, reducing the amount of boilerplate code developers need to write. The library is notably lightweight, with a jar file size of just 100KB. Persism aims to simplify object-relational mapping for Java applications without requiring extensive ceremony or manual configuration.

0 comments Read more at DEV Community

ProgrammingDEV Community ·

External and Internal Attention Are One Shared Operation, Not Two Separate Faculties

Researchers describe external and internal attention as a single selective operation applied in two directions: outward toward sensory inputs and inward toward self-generated mental contents such as memories and rules. Psychologist Michael Posner's 1980 spotlight model and Desimone and Duncan's 1995 biased competition theory together explain how the brain privileges certain perceptual content over others. Chun, Golomb, and Turk-Browne (2011) extended this framework to internal attention, showing the same selection mechanism governs internally generated content. The key claim is not merely that the two modes resemble each other, but that they draw on a single shared pool of cognitive resources with a common capacity limit. A speculative extension suggests attention can also be directed at the system's own ongoing processing, though the authors caution this remains an open question separate from the core argument.

0 comments Read more at DEV Community

ProgrammingDEV Community ·

How Symfony 7 Uses DTOs and MapRequestPayload to Secure API Requests

A technical guide published on DEV Community outlines a modern approach to validating API requests in Symfony 7, arguing that the request itself is the first line of defense against untrusted data. The article criticizes common production practices that rely on manual json_decode() calls and scattered conditional checks, calling them fragile and easy to bypass. It proposes using Data Transfer Objects (DTOs) combined with Symfony's built-in Validator constraints to define and enforce the expected shape of incoming payloads. The #[MapRequestPayload] attribute is highlighted as a clean way to automatically parse and validate request data directly in controller method signatures. The guide also addresses a frequently misunderstood point about where XSS protection belongs within this validation workflow.

0 comments Read more at DEV Community

ProgrammingDEV Community ·

CLAIIM Proposes Governance Layer to Control and Audit AI Agent Actions

As AI agents move beyond answering questions to taking real-world actions in production environments, a critical governance gap has emerged that traditional identity and access management systems cannot address. IAM tools verify whether a credential has permission to reach a system, but cannot determine whether an agent's action falls within its intended scope or bind it to a named accountable human. CLAIIM is a proposed identity control plane for AI agents that introduces four components: governed agent identities with human accountability anchors, a policy gate that evaluates and approves or blocks actions before execution, versioned skills and policies locked at evaluation time, and an append-only audit trail called Chron. In a practical example, a deployment agent can be configured to freely deploy to staging while being explicitly blocked from touching production, with every decision logged instantly. The framework aims to ensure that for any AI-triggered action, operators can immediately answer who acted, under whose authority, within which policy, and with what verifiable proof.

0 comments Read more at DEV Community