HydraHead cuts transformer compute 40% by mixing attention types within layers
Researchers have developed HydraHead, a technique that blends full attention and linear attention at the individual head level within transformer models, rather than swapping entire layers. The method reserves costly quadratic full-attention computation for just 25% of heads, while the remaining 75% use a cheaper linear module called GDN. Despite this aggressive reduction, HydraHead matches the benchmark performance of conventional 3:1 layer-wise hybrid models, even at linear-to-full head ratios as high as 7:1. The approach was evaluated on long-context reading and reasoning tasks after training on 15 billion tokens, and could reduce attention-related FLOPs by roughly 40%. If the gains hold broadly, the technique could enable larger context windows or allow bigger models to run on lower-end hardware.
This is an AI-generated summary. ShortSingh links to the original source for the complete article.
Discussion (0)
Log in to join the discussion and vote.
Log in