Diffusion / Speculative Decoding / Inference
BlockPilot: Instance-Adaptive Policy Learning for Diffusion-based Speculative Decoding
Hao Zhang, Yiming Hu, Yong Wang, Mingqiao Mo, Xin Xiao, Xiangxiang Chu · AMAP, Alibaba Group
BlockPilot: Instance-Adaptive Policy Learning for Diffusion-based Speculative Decoding
The Problem: Fixed Block Size Is Suboptimal
Diffusion-based speculative decoding accelerates LLM inference by using a lightweight diffusion draft model (dLLM) to generate multiple tokens per forward pass via block-level diffusion, which are then verified in parallel by the target model. The standard approach - exemplified by DFlash - fixes the block size at inference time to the same value used during training.
BlockPilot shows this assumption is wrong. The optimal block size varies substantially across individual input samples. Some inputs have strong structural constraints that tolerate large blocks; others are less predictable and benefit from smaller, more conservative blocks. A fixed strategy leaves significant acceleration gains on the table.
Key Insight: Locality Enables Efficient Learning
Through exhaustive block-size sweeps across multiple datasets, the authors make two critical observations:
- Instance-wise variability - Only a subset of samples achieve optimal performance at the training block size. A substantial fraction prefer different sizes at inference time.
- Strong locality - Despite this variability, the optimal block size concentrates within a narrow window around the training configuration (typically B±3). Outside this range, acceptance length drops sharply.
This locality transforms what could be an expensive online search into a small, discrete classification problem - a manageable 2k+1 candidate set.
The Method: Learn to Predict Block Size from Prefilling
BlockPilot introduces a lightweight two-layer MLP predictor that maps the last token's predictive distribution from the target model's prefilling stage to an optimal block size:
Training data construction: For each sample, the predictor enumerates candidate block sizes from the local set {B−k, ..., B+k}, runs full speculative decoding for each, and labels the sample with the block size that maximizes acceptance length. This offline enumeration is cheap (small candidate set) and highly parallelizable.
Prediction at inference: After prefilling, the last token's softmax distribution is fed through the MLP once. The argmax output determines the block size for the entire subsequent decoding process - no per-step overhead.
Minimal cost: The predictor adds only 2-layer MLP with hidden size 2048 (~0.01% of the backbone model's parameters) and millisecond-level latency.
Results
Main Performance (Temperature T=1)
| Model | Metric | DFlash (best fixed) | BlockPilot | Gain |
|---|---|---|---|---|
| Qwen3-4B | Speedup | 3.80× (DFlash-16) | 4.20× | +0.40× |
| Qwen3-4B | Acceptance τ | 5.42 | 5.92 | +0.50 |
| Qwen3-8B | Speedup | 3.55× (DFlash-16) | 3.94× | +0.39× |
| Qwen3-8B | Acceptance τ | 5.03 | 5.73 | +0.70 |
Under temperature T=0 (deterministic):
| Model | DFlash (best fixed) | BlockPilot | Gain |
|---|---|---|---|
| Qwen3-4B | 3.99× | 4.17× | +0.18× |
| Qwen3-8B | 4.42× | 4.66× | +0.24× |
Consistent Across Benchmarks
BlockPilot outperforms all fixed-block DFlash variants and EAGLE-3 across Math (GSM8K, MATH-500, AIME24), Code (HumanEval, MBPP, SWE-Bench), and Chat (MT-Bench) - with gains at both T=0 and T=1, demonstrating robustness to stochasticity.
Ablation Highlights
- Raw prefilling distribution works better than normalized or softmax-preprocessed inputs
- k=2 (candidate radius) provides the best balance of coverage vs. prediction difficulty
- Two-layer MLP (2048 hidden) is sufficient; deeper/wider gives diminishing returns
Why It Matters
BlockPilot makes a simple but powerful point: the decoding policy itself is a learnable component. Existing speculative decoding work has focused on better draft models or verification mechanisms while treating block size as a static hyperparameter. By showing that instance-adaptive block selection yields consistent gains across models, datasets, and temperatures - with negligible overhead and zero changes to the base model - BlockPilot establishes a new dimension for optimization in diffusion-based inference acceleration.
The method is plug-and-play: it integrates seamlessly into existing DFlash-based frameworks without modifying the draft model, target model, or verification procedure.
Paper: arXiv:2606.31315
Code: github.com/AMAP-ML/BlockPilot