LLM / Reasoning / World Model
Orca: The World is in Your Mind
Yihao Wang, Yuheng Ji, Mingyu Cao, Yanqing Shen, Runze Xiao, Huaihai Lyu, Senwei Xie, Euan Liu, Klara Tian, Tianfeng Long, Yichi Zhang, Zhengliang Cai, Ruike Chen, Jifan Zhao, Ruochuan Shi, Zihan Tang, Jing Lyu, Wenxing Tan, Ningbo Zhang, Yangtao Hu, Yuming Gao, Xiansheng Chen, Junkai Zhao, Congsheng Xu, Boan Zhu, Ziqi Wang, Yupu Feng, Qiongqiong Zhang, Yingli Zhao, Yulong Ao, Shaoxuan Xie, You Liu, Guocai Yao, Leiduo Zhang, Xiaodan Liu, Yunyan Zhang, Yance Jiao, Xinyan Yang, Jiaxing Wei, Xu Liu, Tengfei Pan, Shaokai Nie, Chunlei Men, Sen Cui, Xiaojie Jin, Hongyang Li, Jianlan Luo, Yao Mu, Yunchao Wei, Jun Yan, Hang Zhao, Xiaolong Zheng, Jiaming Li, Yonghua Lin, Tiejun Huang, Zhongyuan Wang, Pengwei Wang
What the paper is about
Orca proposes a paradigm shift: instead of optimizing next-token, next-frame, or next-action prediction in isolation, learn a unified world latent space through Next-State-Prediction modeling. After pre-training, the backbone is frozen and lightweight modality-specific decoders are trained to read out text, images, and actions. The core hypothesis is that a stronger world latent produces stronger downstream performance - and the experiments bear this out.
"Rather than being purpose-built for isolated downstream tasks such as question answering, visual frame prediction, or action generation, Orca adopts a fundamentally different modeling paradigm: It first learns an internal representation of world states from multimodal world signals, and subsequently exposes this representation via a suite of dedicated readout interfaces."
Key distinction from prior work: Existing models optimize isolated objectives (LLMs for text, diffusion models for images, policies for actions). Orca treats all of these as readouts from a shared world latent, not as the learning objective itself.
Key contributions
-
Next-State-Prediction paradigm - A unified state-transition modeling framework that replaces task-specific objectives (next-token, next-frame, next-action) with a single latent-state prediction objective.
-
Two complementary learning paradigms:
- Unconscious learning - Dense, natural state transitions from continuous video without labels (observation-only).
- Conscious learning - Sparse, meaningful state transitions guided by language-described events and VQA supervision.
-
Large-scale pre-training inventory - 125K hours of video data and 160M event annotations spanning ego-centric interaction, exo-centric manipulation, action-free robot execution, and natural dynamics.
-
Frozen backbone + lightweight readout design - Demonstrates that a single frozen world latent can support text generation, image prediction, and embodied action generation simultaneously, with only small task-specific modules trainable downstream.
-
Emergent action capability without action labels - Pre-training on video alone (no robot data) transfers to real-robot action generation, suggesting world models can mitigate the robot data scarcity problem.
Methodology highlights
Architecture
Orca uses a pre-trained VLM (Qwen3.5) as its backbone. The encoder learns a world latent through two pathways:
- Unconscious (observation-only): Given frame $v_t$, predict the latent $\hat{v}^l_{t+1}$ of the next frame. Continuous video provides dense temporal supervision for dynamics like motion, occlusion, and scene changes.
- Conscious (event-conditioned): Given frame $v_t$, an event description $e_{t+\Delta}$, and a query, predict the latent $\hat{v}^l_{t+\Delta}$ of a frame in the adjacent event. Language describes what changes and why.
- VQA response generation: Standard next-token prediction on video-grounded question answering, providing common-sense and semantic grounding.
The total pre-training loss combines three terms:
$$\mathcal{L} = \lambda_{\text{obs}}\mathcal{L}{\text{obs}} + \lambda{\text{evt}}\mathcal{L}{\text{evt}} + \lambda{\text{vqa}}\mathcal{L}_{\text{vqa}}$$
Pre-training data
| Data type | Volume | Purpose |
|---|---|---|
| Video (ego-centric, exo-centric, robot execution, natural dynamics) | 125K hours (10% used) | Observation-only + event-conditioned state transitions |
| Event annotations (coarse + fine-grained) | 160M | Paired with language captions for conscious learning |
| VQA data | 11.5M | Commonsense and semantic grounding |
Infrastructure
Built on FlagScale with FSDP2, chunked cross-entropy loss, activation recomputation, and forward/backward communication scheduling. Achieves 4.4× throughput vs. StarVLA (from 0.66 to 2.91 Samples/Sec/GPU).
Downstream readout design
| Modality | Readout module | Trainable params | Key detail |
|---|---|---|---|
| Text | LM head (from backbone) | None (frozen) | Zero-shot from pre-trained backbone |
| Image | MLP adaptor + LoRA on frozen SD3.5 | Lightweight | Latent conditions the diffusion process |
| Action | MLP adaptor + DiT Action Expert (from scratch) | Action Expert only | Flow-matching loss, 200 trajectories/task |
Results
Scaling behavior
Loss curves show monotonic decrease with model size (0.8B → 4B) and data volume, confirming the paradigm is scalable and has not saturated.
Answer 1.1: Orca's learning paradigm is effective and scalable as the model size and data increase.
Answer 1.2: Stronger world latent from pre-training leads to stronger downstream readouts - text, image, and action all improve as pre-training scales, despite the backbone being frozen.
Text generation (zero-shot on OOD benchmarks)
| Model | Size (B) | MVBench | TemporalBench | 3DSRBench | SWITCH | Avg. |
|---|---|---|---|---|---|---|
| Qwen3.5 | 4 | 67.1 | 25.2 | 48.1 | 42.8 | 46.7 |
| Orca | 4 | 65.3 | 34.2 | 52.1 | 55.6 | 51.8 |
Orca-4B outperforms all baselines of comparable size, including specialized VLMs and world models, with particular strength in State Transition (+12.27% over Qwen3.5-4B) and Dynamic Motion (+8.52%).
Image prediction (PRICE-V0.1 benchmark)
| Model | Size (B) | Avg. score |
|---|---|---|
| OmniGen2 | 3+4 | 39.6 ± 10.2 |
| FLUX.1-Kontext | 12 | 40.9 ± 13.5 |
| FLUX.2 [klein] | 4+4 | 56.1 ± 18.1 |
| Orca | 4+2 | 59.8 ± 10.9 |
Orca achieves the highest average with the lowest standard deviation, indicating more consistent and physically grounded prediction. Baselines suffer from object hallucination, teleportation, and irrelevant artifacts; Orca preserves scene, morphology, and contact consistency.
Action generation (real-robot OOD)
| Model | Env OOD (Rule-based) | Obj OOD (Rule-based) | Overall SR | Overall M25 | Overall M50 |
|---|---|---|---|---|---|
| V-JEPA 2.1 w/ AE | 15.2 | 18.8 | 0% | 27 | 7 |
| Qwen3.5 w/ AE | 12.4 | 8.6 | 0% | 18 | 5 |
| $\pi_{0.5}$ | 27.6 | 31.2 | 5% | 54 | 14 |
| Orca | 36.6 | 28.2 | 6% | 55 | 14 |
Orca produces action trajectories that move further, get stuck less often, and recover more effectively after mistakes.
Key qualitative finding: Orca recovers from repeated grasp failures (e.g., spoon-grasp in "Scoop Sugar") while $\pi_{0.5}$ remains stuck in repeated failed attempts.
Ablation: contribution of each pre-training objective
| $\mathcal{L}_{\text{obs}}$ | $\mathcal{L}_{\text{evt}}$ | $\mathcal{L}_{\text{vqa}}$ | Text | Image | Action | Avg. |
|---|---|---|---|---|---|---|
| ✓ | 48.4 | - | 10.2 | 29.3 | ||
| ✓ | ✓ | 58.2 | - | 30.9 | 44.6 | |
| ✓ | ✓ | 50.5 | - | 32.6 | 41.6 | |
| ✓ | ✓ | 50.1 | 54.7 | 23.0 | 42.6 | |
| ✓ | ✓ | ✓ | 51.8 | 59.8 | 32.4 | 48.0 |
- Observation-only ($\mathcal{L}_{\text{obs}}$) is especially important for action - dense temporal dynamics from video transfer to robot control.
- Event-conditioned ($\mathcal{L}_{\text{evt}}$) is critical for image prediction - language-described events align semantic conditions with visual state changes.
- VQA ($\mathcal{L}_{\text{vqa}}$) preserves the language interface and provides common-sense grounding.
Why it matters
Orca represents a serious attempt at a general world foundation model - a model that learns how the world works internally rather than being optimized for any single task. The frozen-backbone + lightweight-readout design is intellectually honest: if the latent is truly good, decoders should be small and task-specific.
Broader implications:
-
Unified intelligence architecture: The paper makes the case that next-token, next-frame, and next-action prediction are all special cases of next-state prediction. This could consolidate the currently fragmented landscape of LLMs, diffusion models, and robot policies into a single paradigm.
-
Robot data scarcity solution: Action capability emerged from video-only pre-training without any action labels. If this transfers at scale, it could unlock generalist robot policies without requiring billions of robot trajectories.
-
Limitations the authors openly acknowledge:
- Only vision and language signals used so far - future versions should incorporate audio, tactile, force, proprioception, and scientific sensor data.
- Supervision happens in ViT latent space (a pre-trained frozen vision encoder), not a native unified latent space - the "true" world latent should be learned directly from raw signals.
Bottom line: Orca is not just another multi-modal model - it's a first-principles rethinking of what a foundation model should optimize. If the Next-State-Prediction paradigm scales (and the scaling curves suggest it will), this could be the architectural template for the next generation of generally intelligent systems.