GEAR: Guided End-to-End AutoRegression for Image Synthesis

Tencent Hunyuan | Peking University
arXiv: 2606.32039 - Submitted June 30, 2026

The Problem

Visual generative models are trained in two stages: (1) a tokenizer (VQ-VAE) is trained for reconstruction, then frozen; (2) a generator (AR transformer or diffusion model) is trained on its latents. This is clean but suboptimal - the tokenizer is blind to whether the latent space it produces is easy for the downstream generator to model. Reconstruction favors high-variance detail; generation favors simple, predictable structure. These goals are in tension.

The Key Insight

What if the AR model could teach the tokenizer which tokens are easy to predict - without collapsing the codebook?

The obstacle is fundamental: VQ index selection is a non-differentiable argmax, so gradients from the AR can't reach the tokenizer. The obvious fix - a straight-through estimator (STE) - collapses (gFID ~105, worse than untrained).

GEAR's solution is elegant: dual readout of the same codebook assignment matrix.

Branch	Assignment	Gradient Flow	Loss
Hard (one-hot)	`argmax` → discrete index	✗ stops at argmax	NTP + REPA alignment (updates AR only)
Soft (temperature-scaled)	`softmax(A/τ)` → weighted embedding blend	✓ differentiable through encoder	REPA alignment (updates tokenizer only)

The NTP loss never touches the tokenizer - that would reward low-entropy collapse. Instead, only the representation-alignment loss (matching DINOv2 features) guides the tokenizer through the differentiable soft channel.

Surprising Result: Opposite of Diffusion

Diffusion-side recipes (REPA-E, VA-VAE, MAETok) make the latent more DINOv2-like → more semantic.

GEAR does the reverse:

What	DINOv2 Similarity	Change
Tokenizer (pre-quant)	CKA image: 0.186 → 0.181	↓
Tokenizer (patch)	CKA patch: 0.173 → 0.107	↓↓↓
AR hidden states	Per-patch DINOv2 alignment	↑↑↑

The tokenizer becomes less semantic - it reorganizes its index distribution toward lower-entropy, more predictable tokens (without sacrificing reconstruction). The AR model picks up the semantic slack, developing strong patch-level DINOv2 alignment in its hidden states. GEAR shifts the alignment burden from tokenizer → AR generator.

Results

ImageNet 256×256 (Class-Conditional)

Model	Params	Epochs	gFID (w/o CFG)	gFID (w/ CFG)
LlamaGen	111M	300	26.26	8.73
LlamaGen-REPA	111M	300	20.16	6.00
GEAR	111M	300	16.96	4.95
LlamaGen	775M	300	15.54∗	3.47∗
LlamaGen-REPA	775M	300	8.20	2.68
GEAR	775M	300	6.76	2.52

Up to 10× faster gFID convergence vs LlamaGen-REPA
Consistent gains across B/L/XL model scales
GEAR at 300 epochs outperforms LlamaGen-REPA at 800 epochs

Text-to-Image (GPIC Benchmark)

Model	Steps	FDD (w/ CFG)
LlamaGen-REPA	100k	198.6
GEAR	100k	177.4
LlamaGen-REPA	390k	127.9
GEAR	390k	115.3

The end-to-end-tuned tokenizer transfers: freeze it and train a fresh AR on top - the tokenizer alone delivers 2.5× faster NTP convergence and 11.1× faster REPA alignment convergence.

Why It Works

Predictability over semantics: The tokenizer learns to emit token sequences the AR can predict causally - lower-entropy codebook usage, more spatially coherent structure.
Patch-level structure > global alignment: GEAR produces AR features with strong per-patch DINOv2 similarity and spatial coherence, where prior methods only achieved global image-level alignment.
Clean optimization: NTP never touches the tokenizer, avoiding the codebook collapse trap. The soft guidance branch provides gradient without interference.

Generalization

Works across three quantizer families:

VQVAE (Euclidean distance assignment)
LFQ (Lookup-Free Quantization)
IBQ (Index Backpropagation Quantization)

And transfers from class-conditional ImageNet to text-to-image generation.

Significance

GEAR closes the loop on the two-stage training paradigm for discrete visual generation. It shows that:

End-to-end training is possible for VQ-AR pipelines without collapse
The optimal tokenizer for AR generation looks different from the optimal tokenizer for diffusion
Representation alignment can be relocated from the input space to the generator's hidden states

This mirrors the historic shift in object detection (R-CNN → DETR) and suggests the same end-to-end unification may define the next generation of visual generative models.

Architecture Diagram (Text)

Image → Encoder → Latents Z → [Assignment Matrix A]
                                          ├──→ argmax → one-hot indices → AR Embedding → Hard Branch → NTP + REPA (updates AR)
                                          └──→ softmax(A/τ) → weighted embed → Soft Branch → REPA (updates Encoder via ∇)
                                                                                                          │
                                                                                                    VQ Loss ← Decoder → Reconstruction

The soft branch is truncated at alignment depth (no full AR forward pass), adding negligible compute.

Key Contributions

First stable end-to-end training for VQ-AR image generation
Dual readout mechanism that decouples NTP (AR only) from guidance (tokenizer only)
Representation relocation discovery - the tokenizer sheds semantics while the AR gains it
Practical speedup: 10× faster convergence, consistent across scales and tasks