LLM / Distillation / RL
DOPD: Dual On-policy Distillation
Gen Li, Qingyi Si, Guibin Zhang, Yuqi Xu, Congcong Wang, Shuai Dong, Kaiwen Tuo, Xiangyu Zeng, Kaituo Feng, Qunzhong Wang, Yang Shi, Xiaobin Hu, Xiangyu Yue, Jiaqi Wang, Shuicheng Yan ยท NUS, MMLab CUHK, PKU, JD Explore Academy
DOPD: Dual On-policy Distillation
The Problem: Privilege Illusion
On-policy distillation (OPD) is a powerful post-training paradigm where a student model samples its own trajectories and receives dense token-level supervision from a stronger teacher. However, when privileged information (e.g., reasoning hints for LLMs or bounding boxes for VLMs) is introduced, a subtle failure mode emerges:
Privilege Illusion - The apparent teacher-student gap conflates two fundamentally distinct components:
- The genuine capability gap - transferable skills the student should learn
- The information asymmetry gap - advantages from privileged inputs the student can never truly acquire
Indiscriminately distilling both causes the student to mimic privileged shortcuts instead of learning real, transferable abilities.
Key Insight: Not All Tokens Are Equal
The authors discover that only a small subset of tokens carries critical capability-bearing signals. They demonstrate this through an elegant ablation:
- Pruning the 20% highest-advantage-gap tokens destroys 50-80% of distillation gains
- Pruning random or low-advantage tokens has negligible impact
This validates using the privilege advantage gap (the log-probability difference between privileged teacher and student) as a reliable routing signal for token-level supervision.
The Method: Advantage-Aware Dual Distillation
DOPD dynamically routes each token to one of four supervision strategies based on two signals:
- Advantage gap (๐) - large = genuine capability gap, small = information asymmetry
- Confidence (qโ, qโ) - how sure each policy is about the token
| Regime | Condition | What It Means | Strategy |
|---|---|---|---|
| LH | Low ๐, High confidence | Info asymmetry, both confident | Light teacher Top-K reverse KL |
| LL | Low ๐, Low confidence | Both uncertain | Weak self-regularization anchor |
| HT | High ๐, Teacher confident | Real capability gap | Full-vocabulary JS divergence (strongest) |
| HS | High ๐, Student confident | Student already knows this | Light self-distillation |
The key innovation: strong teacher supervision is applied only when the teacher demonstrates a credible capability advantage, avoiding the trap of distilling information-asymmetry noise.
Results
Main Performance
| Setting | Avg Gain vs Vanilla OPD | Gap Recovery |
|---|---|---|
| LLM (Qwen3-8B โ 1.7B) | +7.5 pts across 8 benchmarks | 89.8% |
| VLM (Qwen3-VL-8B โ 2B) | +6.0 pts across 8 benchmarks | 69.2% |
DOPD actually surpasses the teacher on 4 of 8 LLM benchmarks (especially reasoning & coding).
Against Strong Baselines
| Baseline | DOPD Improvement |
|---|---|
| ExOPD | +4.4 pts |
| Uni-OPD | +4.8 pts |
| EOPD | +5.3 pts |
Scale Robustness
Across 5 teacher-student pairs (from 8Bโ0.6B to 4Bโ1.7B):
- DOPD achieves +11โ14 pts consistently vs Vanilla OPD's +3.5โ5 pts
- At the largest mismatch (8Bโ0.6B): DOPD = 14.1 pts vs Vanilla = 3.5 pts - a 4ร improvement
Additional Findings
- Entropy collapse avoided - privileged OPD variants collapse; DOPD maintains healthy entropy
- Continual learning - retains knowledge across sequential domains (general โ reasoning โ coding)
- OOD generalization - strong transfer when trained on one domain, tested on another
- Ablation - every token regime contributes; removing any hurts performance
Why It Matters
As models get larger and distillation becomes the primary post-training paradigm, DOPD provides a principled answer to a critical question: when should a student trust its teacher, and when should it trust itself? The advantage-aware routing mechanism is both theoretically grounded and practically effective, making it a strong candidate for the next generation of LLM/VLM post-training pipelines.
Paper: arXiv:2606.30626
Code: Not yet released