Steering Pretrained Drafters during Speculative Decoding

AAAI 2026

F. Berdoz, P. Rheinboldt, R. Wattenhofer

ETH Zurich, Switzerland

speculative-decodingllm-inferenceactivation-steeringlanguage-models

Abstract

Speculative decoding accelerates language model inference by separating generation into fast drafting and parallel verification. Its main limitation is drafter-verifier misalignment, which limits token acceptance and reduces overall effectiveness. While small drafting heads trained from scratch compensate with speed, they struggle when verification dominates latency or when inputs are out of distribution. In contrast, pretrained drafters, though slower, achieve higher acceptance rates thanks to stronger standalone generation capabilities, making them competitive when drafting latency is negligible relative to verification or communication overhead. In this work, we aim to improve the acceptance rates of pretrained drafters by introducing a lightweight dynamic alignment mechanism: a steering vector computed from the verifier’s hidden states and injected into the pretrained drafter. Compared to existing offline alignment methods such as distillation, our approach boosts the number of accepted tokens by up to 35% under standard sampling and 22% under greedy sampling, all while incurring negligible computational overhead. Importantly, our approach can be retrofitted to existing architectures and pretrained models, enabling rapid adoption.

Overview

Speculative decoding accelerates LLM inference by having a lightweight drafter propose tokens that are then verified in parallel by the larger model. The main bottleneck is drafter-verifier misalignment: the more the drafter’s predictions diverge from the verifier’s, the more tokens get rejected.

Pretrained drafters generalize well but lack dynamic alignment with the verifier. Dependent drafters (trained from scratch on verifier hidden states) are fast but brittle outside their training distribution. SD² bridges this gap by introducing a lightweight steering mechanism that extracts predictive signals from the verifier’s hidden states and injects them into the pretrained drafter at inference time.

How it works

Diagram of the SD² steering mechanism showing how verifier hidden features are projected into biases added to the drafter's MLP layers. — **Figure 2:** The steering mechanism concatenates the verifier's high-, medium-, and low-level hidden features and passes them through a linear projection to produce a steering vector. This vector is transformed into a set of biases added to all MLP hidden states in the drafter, just before the activation function.

The method has three components:

Steering vector extraction: After verification, the verifier’s hidden states at the last accepted position are concatenated across three layers (high, mid, low) and linearly projected into a compact steering vector.
Bias injection: The steering vector is mapped through a second linear layer into per-layer biases, which are added inside the drafter’s SwiGLU MLP (before the gating activation) at every layer.
Training: The steering parameters and the drafter are jointly fine-tuned using KL divergence against the verifier’s token distribution, with a random offset to simulate all drafting positions uniformly.

Training procedure of SD² showing how the drafter is aligned to the verifier using KL divergence with a random drafting offset. — **Figure 3:** Training aligns the steered drafter's distribution to the verifier's. A random offset simulates different drafting positions so the steering mechanism learns to encode information about the next k tokens.

Results

SD² is evaluated across four verifier-drafter pairs (Vicuna 1.3 13B / Llama 160M, Qwen3 14B / Qwen3 0.6B, Qwen3 8B / Qwen3 0.6B, Llama 3.1 8B / Llama 3.2 1B) on five benchmarks (UltraChat, HumanEval, XSum, Alpaca, GSM8K).

Bar chart showing average accepted tokens per block for EAGLE-3, Pretrained, Distilled, and SD² across model pairs. — **Figure 4:** Average number of accepted tokens per block across all tasks. Solid bars: T=1 (sampling); hatched bars: T=0 (greedy). SD² consistently achieves higher acceptance rates than both pretrained and distilled drafters. For Vicuna 1.3 with the small Llama 160M drafter, SD² closes the gap with the dependent EAGLE-3 head.

Key numbers

Up to +35% more accepted tokens under standard sampling and +22% under greedy sampling compared to pretrained drafters.
Up to +83% speedup on in-distribution data (Vicuna 1.3), and +61% average speedup across all tasks for that pair.
SD² consistently outperforms distillation, achieving roughly twice the additive speedup under standard sampling.
On out-of-distribution tasks (GSM8K, HumanEval), SD² matches or exceeds pretrained drafter performance, while distillation often degrades.

Long-sequence robustness

Line plots showing accepted tokens per block at different sequence positions for all model pairs. — **Figure 5:** Accepted tokens per block at different positions in the generated sequence. SD² maintains a consistent advantage over distillation across all token positions, confirming that steering works well with increasing sequence length.

Ablations

The ablation studies justify two key design choices:

Steering mechanism: Adding the bias inside the SwiGLU (before gating) outperforms simpler post-MLP injection and more complex conditional variants.
Drafter fine-tuning: Unfreezing the drafter during training yields consistently better results, though steering alone already doubles accepted tokens over the pretrained baseline.

Key takeaway: A lightweight steering vector extracted from the verifier's hidden states and injected as a bias into the drafter's MLP layers can substantially improve speculative decoding acceptance rates, with negligible computational overhead and full compatibility with existing pretrained drafter-verifier pairs.

Citation

@inproceedings{berdoz2026steering,
  author = {Berdoz, F. and Rheinboldt, P. and Wattenhofer, R.},
  title = {{Steering Pretrained Drafters during Speculative Decoding}},
  booktitle = {{AAAI Conference on Artificial Intelligence (AAAI)}},
  year = {2026}
}