DOPD: Dual On-policy Distillation

The Problem: Privilege Illusion

On-policy distillation (OPD) is a powerful post-training paradigm where a student model samples its own trajectories and receives dense token-level supervision from a stronger teacher. However, when privileged information (e.g., reasoning hints for LLMs or bounding boxes for VLMs) is introduced, a subtle failure mode emerges:

Privilege Illusion - The apparent teacher-student gap conflates two fundamentally distinct components:

The genuine capability gap - transferable skills the student should learn

The information asymmetry gap - advantages from privileged inputs the student can never truly acquire

Indiscriminately distilling both causes the student to mimic privileged shortcuts instead of learning real, transferable abilities.

Key Insight: Not All Tokens Are Equal

The authors discover that only a small subset of tokens carries critical capability-bearing signals. They demonstrate this through an elegant ablation:

Pruning the 20% highest-advantage-gap tokens destroys 50-80% of distillation gains
Pruning random or low-advantage tokens has negligible impact

This validates using the privilege advantage gap (the log-probability difference between privileged teacher and student) as a reliable routing signal for token-level supervision.

The Method: Advantage-Aware Dual Distillation

DOPD dynamically routes each token to one of four supervision strategies based on two signals:

Advantage gap (𝒜) - large = genuine capability gap, small = information asymmetry
Confidence (qₛ, qₜ) - how sure each policy is about the token

Regime	Condition	What It Means	Strategy
LH	Low 𝒜, High confidence	Info asymmetry, both confident	Light teacher Top-K reverse KL
LL	Low 𝒜, Low confidence	Both uncertain	Weak self-regularization anchor
HT	High 𝒜, Teacher confident	Real capability gap	Full-vocabulary JS divergence (strongest)
HS	High 𝒜, Student confident	Student already knows this	Light self-distillation

The key innovation: strong teacher supervision is applied only when the teacher demonstrates a credible capability advantage, avoiding the trap of distilling information-asymmetry noise.

Results

Main Performance

Setting	Avg Gain vs Vanilla OPD	Gap Recovery
LLM (Qwen3-8B → 1.7B)	+7.5 pts across 8 benchmarks	89.8%
VLM (Qwen3-VL-8B → 2B)	+6.0 pts across 8 benchmarks	69.2%

DOPD actually surpasses the teacher on 4 of 8 LLM benchmarks (especially reasoning & coding).

Against Strong Baselines

Baseline	DOPD Improvement
ExOPD	+4.4 pts
Uni-OPD	+4.8 pts
EOPD	+5.3 pts

Scale Robustness

Across 5 teacher-student pairs (from 8B→0.6B to 4B→1.7B):

DOPD achieves +11–14 pts consistently vs Vanilla OPD's +3.5–5 pts
At the largest mismatch (8B→0.6B): DOPD = 14.1 pts vs Vanilla = 3.5 pts - a 4× improvement

Additional Findings

Entropy collapse avoided - privileged OPD variants collapse; DOPD maintains healthy entropy
Continual learning - retains knowledge across sequential domains (general → reasoning → coding)
OOD generalization - strong transfer when trained on one domain, tested on another
Ablation - every token regime contributes; removing any hurts performance

Why It Matters

As models get larger and distillation becomes the primary post-training paradigm, DOPD provides a principled answer to a critical question: when should a student trust its teacher, and when should it trust itself? The advantage-aware routing mechanism is both theoretically grounded and practically effective, making it a strong candidate for the next generation of LLM/VLM post-training pipelines.

Paper: arXiv:2606.30626
Code: Not yet released