Speculative Decoding Benchmarked on CPU: Acceptance Rates Vary Sharply by Task
A developer ran a controlled benchmark of Speculative Decoding (SD) using Qwen2.5-0.5B as the draft model and Qwen2.5-1.5B as the target, testing across code, JSON, and story generation tasks on a CPU-only machine. SD was 49–62% slower than standard autoregressive generation across all task types, consistent with the theoretical inequality that governs when SD wins or loses. Mean token acceptance lengths differed significantly by task: JSON scored highest at 3.50, code at 3.00, and creative story generation lowest at 2.11, reflecting how structured tasks are easier for draft models to predict. A key finding was that 15–30% of draft rounds resulted in zero accepted tokens, meaning the system paid full compute cost for both draft and target passes while producing only a single token. The author notes that while CPU speed numbers are not directly transferable, the acceptance-length patterns are relevant to GPU deployments and suggest task type is a stronger predictor of SD gains than model size alone.
This is an AI-generated summary. ShortSingh links to the original source for the complete article.
Discussion (0)
Log in to join the discussion and vote.
Log in