SShortSingh.
Back to feed

Chunked Prefill: How One Long Prompt Stalls All Users on an LLM Server

0
·1 views

When a user submits a very long prompt to an LLM service, the server's GPU dedicates an entire forward-pass step to processing it, freezing token generation for all other users until it finishes — a problem known as prefill-decode interference. This happens because prefill (processing input tokens) is compute-intensive and runs in one large batch, while decode (generating output tokens one at a time) is memory-bandwidth-bound and needs a steady cadence to deliver smooth streaming. Chunked prefill addresses this by splitting long prompts into fixed-size token chunks and interleaving them with decode tokens within each forward pass, capping step time and keeping streams smooth. In vLLM, the key setting is max_num_batched_tokens — lower values around 2048 suit latency-sensitive chat, while higher values above 8192 favor throughput-heavy workloads. For workloads requiring full isolation, disaggregated prefill — using separate GPU pools for each phase — offers a more complete solution than interleaving alone.

Read the full story at DEV Community

This is an AI-generated summary. ShortSingh links to the original source for the complete article.

Discussion (0)

Log in to join the discussion and vote.

Log in

Related stories

0
ProgrammingDEV Community ·

In regulated lending, LLMs handle only a fraction of the AI pipeline

A DEV Community post highlights a common misconception about AI's role in complex, regulated financial workflows. In regulated lending, large language models handle only the final prose-drafting step, while the heavier work — document ingestion, data extraction, validation, and financial calculations — must be performed by deterministic code. Financial calculations in particular cannot be delegated to LLMs, which risk silently producing rounding errors or inaccurate outputs. Canada's OSFI E-21 guideline reinforces this design constraint, requiring human ownership of risk decisions. The author argues that winning AI teams in banking are those who precisely identify the narrow slice of the problem where a language model is actually appropriate.

0
ProgrammingDEV Community ·

Ukraine Vectorizes 33.7M Court Decisions Using Voyage AI for Semantic Legal Search

A Ukrainian legal tech team is embedding the country's entire open-access court decision registry, EDRSR, into a vector database to enable semantic search for lawyers. The project uses Voyage AI's voyage-3.5 model to convert court rulings into 1024-dimensional vectors stored in a self-hosted Qdrant instance on AWS EC2. The database already holds over 44 million vectors across criminal, civil, commercial, and misdemeanor case types, with civil cases — the largest cohort at 33.7 million documents — currently 42% complete. Documents are chunked into segments of up to 2,048 characters to improve retrieval quality, since individual court rulings can run up to 200,000 characters. Once civil case processing is finished, the collection is expected to exceed 63 million vectors, making it roughly 100 times larger than a typical RAG deployment.

0
ProgrammingDEV Community ·

SecondLayer Maps Cost and Design of 860B Legal AI Trained on 2TB Ukrainian Law

Ukrainian legal-tech firm SecondLayer has outlined a hypothetical project to train a 860-billion-parameter Mixture-of-Experts AI model on approximately 2 terabytes of Ukrainian and European legal data hosted on Google Cloud Platform. The corpus includes 96.2 million full-text Ukrainian court decisions, public registries, annotated legislation, Supreme Court rulings, and Spanish and EU legal texts. After deduplication and cleaning, the usable training corpus is estimated at 800–1,000 GB, yielding roughly 280–330 billion tokens — about 50 times smaller than DeepSeek V3's original 14.8 trillion-token dataset. The proposed architecture mirrors DeepSeek V3, with 671 billion total parameters but only 37 billion active per token, making high-volume inference more cost-efficient than dense models. The exercise is presented as a technical breakdown of dataset composition, model architecture, compute costs, and the capabilities such a domain-specific legal model could deliver.

Chunked Prefill: How One Long Prompt Stalls All Users on an LLM Server · ShortSingh