IndexCache Cuts Redundant Computation in DeepSeek Sparse Attention by Sharing Index Layers
Researchers from Tsinghua University and Z.ai identified a key bottleneck in DeepSeek Sparse Attention (DSA): while the method reduces core attention cost, its token indexer still runs at every layer, creating an O(NL²) computational overhead at long contexts. Their proposed solution, IndexCache, divides transformer layers into two roles — Full (F) layers that compute and cache a fresh top-k token selection, and Shared (S) layers that simply reuse the nearest cached result. The approach is motivated by empirical observation that adjacent layers select 70–100% of the same important tokens, making repeated indexer runs largely redundant. IndexCache requires no architectural changes, only a single conditional branch in the forward pass, and adds no extra GPU memory. The mechanism underlies the 'IndexShare' feature in GLM-5.2 and was detailed in a 2026 paper by Bai, Dong, and colleagues.
This is an AI-generated summary. ShortSingh links to the original source for the complete article.
Discussion (0)
Log in to join the discussion and vote.
Log in