Why GPT Miscounts Letters in 'Strawberry': BPE Tokenization Explained

·1 views

Large language models do not read text as individual letters but instead process it as chunks called tokens, produced by an algorithm called Byte-Pair Encoding (BPE). BPE works by repeatedly merging the most frequently co-occurring character pairs in training data until a vocabulary of roughly 50,000 tokens is built. As a result, the word 'strawberry' is split into 'straw' and 'berry', making the letter 'r' invisible to the model as a standalone character — which explains why AI systems often miscount letters. Capitalization and punctuation can also change how words are tokenized, sometimes multiplying token count and therefore API costs significantly. An interactive BPE simulator has been released to help users observe token formation in real time and understand these limitations firsthand.

Read the full story at DEV Community

This is an AI-generated summary. ShortSingh links to the original source for the complete article.

Discussion (0)

Akuna Capital Quant Super Day: Full Interview Process Breakdown for 2026

A candidate who recently completed Akuna Capital's full recruitment process for a Quant Software Engineering role in Toronto has shared a detailed account of the experience. The entire process, from application to final decision, took approximately three to four weeks and culminated in a single-day onsite event attended by around a dozen candidates. The Super Day comprised multiple rounds including HR behavioral, technical coding, an algorithm competition, and a whiteboard system design interview. Prior stages included a speed-based mental math assessment, a personality measure with logical reasoning components, and a recorded HireVue interview featuring both behavioral and math questions. The author noted that detailed Super Day accounts are rare online and published the breakdown to help future applicants prepare more effectively.

0 comments Read more at DEV Community

ProgrammingDEV Community ·

Agent Tool-Calling Pattern Bridges AI Intent and Reliable API Execution

The Agent Tool-Calling inference pattern addresses a core weakness in AI systems where language models must interact with strictly deterministic APIs. The main failure risk, known as Handoff Hallucination, occurs when a model calls a function with incorrect parameters, missing keys, or fabricated values. A closed-loop architecture solves this by enforcing strict JSON schema contracts, ensuring the model either produces a valid tool call or triggers a self-correcting loop before any error reaches the database. Model Context Protocol (MCP) standardizes how tools are described and invoked, making backend services reliable executors of model intent. However, every additional tool expands the security surface and adds schema governance overhead, often requiring significant engineering effort to build robust validation layers.

0 comments Read more at DEV Community

ProgrammingDEV Community ·

AI Engineer World's Fair Draws 7,000+ Attendees in San Francisco

The AI Engineer World's Fair, described as the largest gathering of AI engineers in the conference's three-year history, concluded in San Francisco's Moscone West venue with over 7,000 attendees. The event featured more than 100 workshops covering topics from product demonstrations to coding classes aimed at making AI models more practical. A central theme was the impact of AI on employment, with skilled engineers seen as in high demand while workers in traditional roles face an uncertain outlook. Debate around agentic or loop-based coding systems was prominent throughout the week, with industry leaders including Microsoft CEO Satya Nadella previously signaling it as a major next step. The U.S. government also lifted restrictions on Anthropic's Claude 5 models during the conference period, adding further discussion around AI capabilities and security implications.

0 comments Read more at DEV Community

ProgrammingDEV Community ·

Good Software Engineering Practices Are Just Being Rebranded as AI Skills

A satirical piece published on DEV Community argues that many skills promoted as cutting-edge AI development techniques are simply established software engineering best practices in disguise. The author points to research cited at the AI Engineer World's Fair suggesting that well-structured codebases, thorough documentation, and robust CI/CD pipelines measurably improve AI agent performance. Similarly, advice around prompt caching and API cost optimization mirrors long-standing principles of efficient systems design. The piece uses humor to highlight how the industry is repackaging fundamentals — such as clear process documentation and observability — as novel AI-era requirements. The underlying message is that teams investing in solid engineering foundations were already well-positioned for the AI transition, whether they realized it or not.

0 comments Read more at DEV Community