BlockPilot: Instance-Adaptive Policy Learning for Diffusion-based Speculative Decoding

Authors: Hao Zhang, Yiming Hu, Yong Wang, Mingqiao Mo, Xin Xiao, Xiangxiang Chu

arXiv ID: 2606.31315

Problem: Existing diffusion-based speculative decoding methods use a fixed block size for all inputs, which is suboptimal since the optimal block size varies across samples.

Key Methodology:

Formulates block-size selection as a lightweight policy learning problem, predicting the optimal block size from the prefilling-stage representation in a single forward pass
Leverages the observation that optimal block sizes exhibit a clear local structure concentrated around the training block size, reducing the problem to a low-dimensional, structured decision space
Plug-and-play design that introduces minimal overhead and integrates seamlessly with existing diffusion-based speculative decoders

Key Results: Achieves an acceptance length of 5.92 and a 4.20× speedup on Qwen3-4B at temperature T=1, consistently outperforming fixed-block-size baselines across diverse settings.

Applied Context: Builders deploying LLMs can drop BlockPilot into existing inference pipelines to get faster generation with no accuracy loss, especially beneficial for latency-sensitive applications like real-time chat and code completion where variable-length speculation matters.