Orca: The World is in Your Mind

What the paper is about

Orca proposes a paradigm shift: instead of optimizing next-token, next-frame, or next-action prediction in isolation, learn a unified world latent space through Next-State-Prediction modeling. After pre-training, the backbone is frozen and lightweight modality-specific decoders are trained to read out text, images, and actions. The core hypothesis is that a stronger world latent produces stronger downstream performance - and the experiments bear this out.

"Rather than being purpose-built for isolated downstream tasks such as question answering, visual frame prediction, or action generation, Orca adopts a fundamentally different modeling paradigm: It first learns an internal representation of world states from multimodal world signals, and subsequently exposes this representation via a suite of dedicated readout interfaces."

Key distinction from prior work: Existing models optimize isolated objectives (LLMs for text, diffusion models for images, policies for actions). Orca treats all of these as readouts from a shared world latent, not as the learning objective itself.

Key contributions

Next-State-Prediction paradigm - A unified state-transition modeling framework that replaces task-specific objectives (next-token, next-frame, next-action) with a single latent-state prediction objective.
Two complementary learning paradigms:
- Unconscious learning - Dense, natural state transitions from continuous video without labels (observation-only).
- Conscious learning - Sparse, meaningful state transitions guided by language-described events and VQA supervision.
Large-scale pre-training inventory - 125K hours of video data and 160M event annotations spanning ego-centric interaction, exo-centric manipulation, action-free robot execution, and natural dynamics.
Frozen backbone + lightweight readout design - Demonstrates that a single frozen world latent can support text generation, image prediction, and embodied action generation simultaneously, with only small task-specific modules trainable downstream.
Emergent action capability without action labels - Pre-training on video alone (no robot data) transfers to real-robot action generation, suggesting world models can mitigate the robot data scarcity problem.

Methodology highlights

Architecture

Orca uses a pre-trained VLM (Qwen3.5) as its backbone. The encoder learns a world latent through two pathways:

Unconscious (observation-only): Given frame $v_t$, predict the latent $\hat{v}^l_{t+1}$ of the next frame. Continuous video provides dense temporal supervision for dynamics like motion, occlusion, and scene changes.
Conscious (event-conditioned): Given frame $v_t$, an event description $e_{t+\Delta}$, and a query, predict the latent $\hat{v}^l_{t+\Delta}$ of a frame in the adjacent event. Language describes what changes and why.
VQA response generation: Standard next-token prediction on video-grounded question answering, providing common-sense and semantic grounding.

The total pre-training loss combines three terms:

$$\mathcal{L} = \lambda_{\text{obs}}\mathcal{L}{\text{obs}} + \lambda{\text{evt}}\mathcal{L}{\text{evt}} + \lambda{\text{vqa}}\mathcal{L}_{\text{vqa}}$$

Pre-training data

Data type	Volume	Purpose
Video (ego-centric, exo-centric, robot execution, natural dynamics)	125K hours (10% used)	Observation-only + event-conditioned state transitions
Event annotations (coarse + fine-grained)	160M	Paired with language captions for conscious learning
VQA data	11.5M	Commonsense and semantic grounding

Infrastructure

Built on FlagScale with FSDP2, chunked cross-entropy loss, activation recomputation, and forward/backward communication scheduling. Achieves 4.4× throughput vs. StarVLA (from 0.66 to 2.91 Samples/Sec/GPU).

Downstream readout design

Modality	Readout module	Trainable params	Key detail
Text	LM head (from backbone)	None (frozen)	Zero-shot from pre-trained backbone
Image	MLP adaptor + LoRA on frozen SD3.5	Lightweight	Latent conditions the diffusion process
Action	MLP adaptor + DiT Action Expert (from scratch)	Action Expert only	Flow-matching loss, 200 trajectories/task

Results

Scaling behavior

Loss curves show monotonic decrease with model size (0.8B → 4B) and data volume, confirming the paradigm is scalable and has not saturated.

Answer 1.1: Orca's learning paradigm is effective and scalable as the model size and data increase.

Answer 1.2: Stronger world latent from pre-training leads to stronger downstream readouts - text, image, and action all improve as pre-training scales, despite the backbone being frozen.

Text generation (zero-shot on OOD benchmarks)

Model	Size (B)	MVBench	TemporalBench	3DSRBench	SWITCH	Avg.
Qwen3.5	4	67.1	25.2	48.1	42.8	46.7
Orca	4	65.3	34.2	52.1	55.6	51.8

Orca-4B outperforms all baselines of comparable size, including specialized VLMs and world models, with particular strength in State Transition (+12.27% over Qwen3.5-4B) and Dynamic Motion (+8.52%).

Image prediction (PRICE-V0.1 benchmark)

Model	Size (B)	Avg. score
OmniGen2	3+4	39.6 ± 10.2
FLUX.1-Kontext	12	40.9 ± 13.5
FLUX.2 [klein]	4+4	56.1 ± 18.1
Orca	4+2	59.8 ± 10.9

Orca achieves the highest average with the lowest standard deviation, indicating more consistent and physically grounded prediction. Baselines suffer from object hallucination, teleportation, and irrelevant artifacts; Orca preserves scene, morphology, and contact consistency.

Action generation (real-robot OOD)

Model	Env OOD (Rule-based)	Obj OOD (Rule-based)	Overall SR	Overall M25	Overall M50
V-JEPA 2.1 w/ AE	15.2	18.8	0%	27	7
Qwen3.5 w/ AE	12.4	8.6	0%	18	5
$\pi_{0.5}$	27.6	31.2	5%	54	14
Orca	36.6	28.2	6%	55	14

Orca produces action trajectories that move further, get stuck less often, and recover more effectively after mistakes.

Key qualitative finding: Orca recovers from repeated grasp failures (e.g., spoon-grasp in "Scoop Sugar") while $\pi_{0.5}$ remains stuck in repeated failed attempts.

Ablation: contribution of each pre-training objective

$\mathcal{L}_{\text{obs}}$	$\mathcal{L}_{\text{evt}}$	$\mathcal{L}_{\text{vqa}}$	Text	Image	Action	Avg.
✓			48.4	-	10.2	29.3
✓	✓		58.2	-	30.9	44.6
	✓	✓	50.5	-	32.6	41.6
✓		✓	50.1	54.7	23.0	42.6
✓	✓	✓	51.8	59.8	32.4	48.0

Observation-only ($\mathcal{L}_{\text{obs}}$) is especially important for action - dense temporal dynamics from video transfer to robot control.
Event-conditioned ($\mathcal{L}_{\text{evt}}$) is critical for image prediction - language-described events align semantic conditions with visual state changes.
VQA ($\mathcal{L}_{\text{vqa}}$) preserves the language interface and provides common-sense grounding.

Why it matters

Orca represents a serious attempt at a general world foundation model - a model that learns how the world works internally rather than being optimized for any single task. The frozen-backbone + lightweight-readout design is intellectually honest: if the latent is truly good, decoders should be small and task-specific.

Broader implications:

Unified intelligence architecture: The paper makes the case that next-token, next-frame, and next-action prediction are all special cases of next-state prediction. This could consolidate the currently fragmented landscape of LLMs, diffusion models, and robot policies into a single paradigm.
Robot data scarcity solution: Action capability emerged from video-only pre-training without any action labels. If this transfers at scale, it could unlock generalist robot policies without requiring billions of robot trajectories.
Limitations the authors openly acknowledge:
1. Only vision and language signals used so far - future versions should incorporate audio, tactile, force, proprioception, and scientific sensor data.
2. Supervision happens in ViT latent space (a pre-trained frozen vision encoder), not a native unified latent space - the "true" world latent should be learned directly from raw signals.

Bottom line: Orca is not just another multi-modal model - it's a first-principles rethinking of what a foundation model should optimize. If the Next-State-Prediction paradigm scales (and the scaling curves suggest it will), this could be the architectural template for the next generation of generally intelligent systems.