CacheWeaver Cuts RAG Response Latency Up to 33% by Reordering Prompt Evidence
Researchers published CacheWeaver on June 18, 2026, a prompt-layer technique designed to reduce time-to-first-token in retrieval-augmented generation (RAG) systems. The method works by reordering retrieved evidence chunks within the prompt to maximize reuse of the serving engine's KV prefix cache, without modifying the engine itself or the retrieved documents. Because prefix cache reuse only works from the front of a prompt, the order in which evidence chunks appear determines how much cached computation can be skipped. Tested across three vLLM configurations, CacheWeaver reduced median time-to-first-token by roughly 20–33% compared to naive retrieval-order caching, achieving 97.5% of the theoretical maximum gain from an oracle ordering. No degradation in answer quality was observed in the reported evaluations.
This is an AI-generated summary. ShortSingh links to the original source for the complete article.
Discussion (0)
Log in to join the discussion and vote.
Log in