Chunked Prefill: How One Long Prompt Stalls All Users on an LLM Server
When a user submits a very long prompt to an LLM service, the server's GPU dedicates an entire forward-pass step to processing it, freezing token generation for all other users until it finishes — a problem known as prefill-decode interference. This happens because prefill (processing input tokens) is compute-intensive and runs in one large batch, while decode (generating output tokens one at a time) is memory-bandwidth-bound and needs a steady cadence to deliver smooth streaming. Chunked prefill addresses this by splitting long prompts into fixed-size token chunks and interleaving them with decode tokens within each forward pass, capping step time and keeping streams smooth. In vLLM, the key setting is max_num_batched_tokens — lower values around 2048 suit latency-sensitive chat, while higher values above 8192 favor throughput-heavy workloads. For workloads requiring full isolation, disaggregated prefill — using separate GPU pools for each phase — offers a more complete solution than interleaving alone.
This is an AI-generated summary. ShortSingh links to the original source for the complete article.
Discussion (0)
Log in to join the discussion and vote.
Log in