Hosted LLM Gateways Charge ~5% Per Token — Here Is What You Actually Pay For

·1 views

Hosted LLM routers like OpenRouter and Requesty have grown rapidly, with OpenRouter processing 25 trillion tokens per week, but they charge roughly 5% on every token routed through their infrastructure. For teams spending $500–$2,000 per engineer monthly on AI coding workloads, that fee adds up significantly alongside data-privacy concerns, since all prompts and code transit third-party servers. Self-hosted gateways eliminate the routing fee and keep data in-house, and can also apply token optimizations — such as compressing tool results — that reduce provider billing further. However, self-hosting requires managing your own provider API keys, software updates, and support, and lacks the instant access to new models that hosted marketplaces provide. The practical middle ground for most teams is routing high-volume or sensitive requests through a self-hosted proxy while reserving hosted routers for cases where broad model access and consolidated billing justify the cost.

Read the full story at DEV Community

This is an AI-generated summary. ShortSingh links to the original source for the complete article.

Discussion (0)

Developer Uses ONNX Runtime and Pyannote 3.0 to Split Two-Speaker Audio Into Segments

A developer has demonstrated how to detect speaker changes in a two-person audio conversation using an ONNX version of the Pyannote Segmentation 3.0 model running on CPU via ONNX Runtime. The experiment uses FFmpeg to decode a roughly 14-second MP3 recording into a 16 kHz mono waveform, which is then processed in 10-second windows to identify where one speaker gives way to another. The pipeline successfully separates six alternating utterances into six individual WAV files while maintaining consistent speaker indexing throughout. Post-processing steps handle silence, brief fluctuations, and potential overlapping speech using probability thresholds and minimum segment duration rules. The author notes this is not a full diarization pipeline, as it relies on the model's internal speaker indexes rather than embedding comparison or clustering across longer recordings.

0 comments Read more at DEV Community

ProgrammingDEV Community ·

Developer Uses Custom BMAD Skill to Automate Multi-Story Implementation Loop

A developer working with the BMAD AI-assisted development framework identified a key pain point: the repetitive, manual approval steps required during the implementation phase. While exploring alternatives like the Superpowers framework, the developer found BMAD superior for large, complex projects due to its structured planning stages covering brainstorming, PRD, UX, and architecture. To address the automation gap, the developer devised a custom 'skill' that instructs a main agent to spawn independent sub-agents for each development story, running them in a loop until all stories are complete. This approach avoids context bloat by keeping each story implementation in a separate session, and is designed to pause only when a genuine blocker is encountered. The solution eliminates the need for a Python orchestration script and leverages BMAD's own agent capabilities to streamline the end-to-end development workflow.

0 comments Read more at DEV Community

ProgrammingDEV Community ·

Why AI Search Engines Often Ignore Well-Ranked Single-Page Web Apps

Single-page applications built with frameworks like React or Vue may be effectively invisible to AI crawlers, which typically do not execute JavaScript and only read raw server-sent HTML. A 2025 study found that around 68 percent of pages cited in AI Overviews did not appear in the top ten organic Google search results, highlighting that ranking and being cited by AI are now distinct challenges. This means a web app can hold a top Google ranking yet still be absent from AI-generated answers, or worse, be misrepresented by AI models pulling incomplete data. Traditional SEO guidance, largely aimed at content teams managing blogs, does not address the technical needs of developers building modern web apps for AI discoverability. Developers are being advised to provide cleaner, machine-readable content structures so that AI systems can accurately represent their products in search responses.

0 comments Read more at DEV Community

ProgrammingDEV Community ·

Why 'Loop Engineering' in AI Works Now but Struggled in 2022–23

Loop engineering — where an AI agent repeatedly generates, tests, and refines its output until a problem is solved — is not a new concept, but earlier language models lacked the consistency and context capacity to make it practical. Older models frequently misunderstood feedback, repeated mistakes, and lost track of prior attempts as limited context windows forced older messages to be dropped. Advances in three key areas have changed this: larger context windows let models retain full iteration histories, improved consistency means models now converge on correct solutions rather than producing varied half-baked answers, and better tool integration allows models to actually run code and read real error outputs. Falling inference costs have also made running multiple back-to-back iterations economically viable, whereas in 2022–23 the compute expense discouraged casual looping. Tools like Claude Code automate this generate–verify–retry cycle, but the core mechanic remains a simple feedback loop that any user can replicate manually with any capable model.

0 comments Read more at DEV Community