Why GPT Miscounts Letters in 'Strawberry': BPE Tokenization Explained
Large language models do not read text as individual letters but instead process it as chunks called tokens, produced by an algorithm called Byte-Pair Encoding (BPE). BPE works by repeatedly merging the most frequently co-occurring character pairs in training data until a vocabulary of roughly 50,000 tokens is built. As a result, the word 'strawberry' is split into 'straw' and 'berry', making the letter 'r' invisible to the model as a standalone character — which explains why AI systems often miscount letters. Capitalization and punctuation can also change how words are tokenized, sometimes multiplying token count and therefore API costs significantly. An interactive BPE simulator has been released to help users observe token formation in real time and understand these limitations firsthand.
This is an AI-generated summary. ShortSingh links to the original source for the complete article.

Discussion (0)
Log in to join the discussion and vote.
Log in