Image Generation / Autoregressive / Diffusion
GEAR: Guided End-to-End AutoRegression for Image Synthesis
Bin Lin, Zheyuan Liu, Chenguo Lin, Sixiang Chen, Yunyang Ge, Yunlong Lin, Jianwei Zhang, Miles Yang, Zhao Zhong, Liefeng Bo, Li Yuan · Tencent Hunyuan
GEAR: Guided End-to-End AutoRegression for Image Synthesis
Tencent Hunyuan | Peking University
arXiv: 2606.32039 - Submitted June 30, 2026
The Problem
Visual generative models are trained in two stages: (1) a tokenizer (VQ-VAE) is trained for reconstruction, then frozen; (2) a generator (AR transformer or diffusion model) is trained on its latents. This is clean but suboptimal - the tokenizer is blind to whether the latent space it produces is easy for the downstream generator to model. Reconstruction favors high-variance detail; generation favors simple, predictable structure. These goals are in tension.
The Key Insight
What if the AR model could teach the tokenizer which tokens are easy to predict - without collapsing the codebook?
The obstacle is fundamental: VQ index selection is a non-differentiable argmax, so gradients from the AR can't reach the tokenizer. The obvious fix - a straight-through estimator (STE) - collapses (gFID ~105, worse than untrained).
GEAR's solution is elegant: dual readout of the same codebook assignment matrix.
| Branch | Assignment | Gradient Flow | Loss |
|---|---|---|---|
| Hard (one-hot) | argmax → discrete index | ✗ stops at argmax | NTP + REPA alignment (updates AR only) |
| Soft (temperature-scaled) | softmax(A/τ) → weighted embedding blend | ✓ differentiable through encoder | REPA alignment (updates tokenizer only) |
The NTP loss never touches the tokenizer - that would reward low-entropy collapse. Instead, only the representation-alignment loss (matching DINOv2 features) guides the tokenizer through the differentiable soft channel.
Surprising Result: Opposite of Diffusion
Diffusion-side recipes (REPA-E, VA-VAE, MAETok) make the latent more DINOv2-like → more semantic.
GEAR does the reverse:
| What | DINOv2 Similarity | Change |
|---|---|---|
| Tokenizer (pre-quant) | CKA image: 0.186 → 0.181 | ↓ |
| Tokenizer (patch) | CKA patch: 0.173 → 0.107 | ↓↓↓ |
| AR hidden states | Per-patch DINOv2 alignment | ↑↑↑ |
The tokenizer becomes less semantic - it reorganizes its index distribution toward lower-entropy, more predictable tokens (without sacrificing reconstruction). The AR model picks up the semantic slack, developing strong patch-level DINOv2 alignment in its hidden states. GEAR shifts the alignment burden from tokenizer → AR generator.
Results
ImageNet 256×256 (Class-Conditional)
| Model | Params | Epochs | gFID (w/o CFG) | gFID (w/ CFG) |
|---|---|---|---|---|
| LlamaGen | 111M | 300 | 26.26 | 8.73 |
| LlamaGen-REPA | 111M | 300 | 20.16 | 6.00 |
| GEAR | 111M | 300 | 16.96 | 4.95 |
| LlamaGen | 775M | 300 | 15.54∗ | 3.47∗ |
| LlamaGen-REPA | 775M | 300 | 8.20 | 2.68 |
| GEAR | 775M | 300 | 6.76 | 2.52 |
- Up to 10× faster gFID convergence vs LlamaGen-REPA
- Consistent gains across B/L/XL model scales
- GEAR at 300 epochs outperforms LlamaGen-REPA at 800 epochs
Text-to-Image (GPIC Benchmark)
| Model | Steps | FDD (w/ CFG) |
|---|---|---|
| LlamaGen-REPA | 100k | 198.6 |
| GEAR | 100k | 177.4 |
| LlamaGen-REPA | 390k | 127.9 |
| GEAR | 390k | 115.3 |
The end-to-end-tuned tokenizer transfers: freeze it and train a fresh AR on top - the tokenizer alone delivers 2.5× faster NTP convergence and 11.1× faster REPA alignment convergence.
Why It Works
- Predictability over semantics: The tokenizer learns to emit token sequences the AR can predict causally - lower-entropy codebook usage, more spatially coherent structure.
- Patch-level structure > global alignment: GEAR produces AR features with strong per-patch DINOv2 similarity and spatial coherence, where prior methods only achieved global image-level alignment.
- Clean optimization: NTP never touches the tokenizer, avoiding the codebook collapse trap. The soft guidance branch provides gradient without interference.
Generalization
Works across three quantizer families:
- VQVAE (Euclidean distance assignment)
- LFQ (Lookup-Free Quantization)
- IBQ (Index Backpropagation Quantization)
And transfers from class-conditional ImageNet to text-to-image generation.
Significance
GEAR closes the loop on the two-stage training paradigm for discrete visual generation. It shows that:
- End-to-end training is possible for VQ-AR pipelines without collapse
- The optimal tokenizer for AR generation looks different from the optimal tokenizer for diffusion
- Representation alignment can be relocated from the input space to the generator's hidden states
This mirrors the historic shift in object detection (R-CNN → DETR) and suggests the same end-to-end unification may define the next generation of visual generative models.
Architecture Diagram (Text)
Image → Encoder → Latents Z → [Assignment Matrix A]
├──→ argmax → one-hot indices → AR Embedding → Hard Branch → NTP + REPA (updates AR)
└──→ softmax(A/τ) → weighted embed → Soft Branch → REPA (updates Encoder via ∇)
│
VQ Loss ← Decoder → Reconstruction
The soft branch is truncated at alignment depth (no full AR forward pass), adding negligible compute.
Key Contributions
- First stable end-to-end training for VQ-AR image generation
- Dual readout mechanism that decouples NTP (AR only) from guidance (tokenizer only)
- Representation relocation discovery - the tokenizer sheds semantics while the AR gains it
- Practical speedup: 10× faster convergence, consistent across scales and tasks