Does VLA Even Know the Basics?

The Problem: Knowledge Blindness in Embodied Agents

Vision-Language-Action (VLA) models are built by fine-tuning powerful pretrained VLMs on robotics data. The implicit assumption is that commonsense and world knowledge survive intact. This paper asks a uncomfortable question: does it actually?

The issue is that VLA benchmarks (LIBERO, CALVIN, VLABench) measure only manipulation success - whether the arm picked up the right object. Failure is ambiguous: did the agent not know the cup is fragile, or did it just miss the grasp? Text-based probing of the VLM backbone sidesteps the real question - can the VLA act on the knowledge?

Act2Answer: A Lightweight Embodied Protocol

The authors introduce Act2Answer, a protocol that converts VLM-style QA into tabletop robot episodes:

An instruction is given (e.g., "Place the cube on the sad face")
Two candidate images sit on plates in the scene
The agent must place a cube on the correct answer
Success is measured by Soft Success Rate (SR) - proximity to the correct plate within tolerance ε

This strips away long-horizon planning and complex manipulation, isolating the knowledge-to-action pathway.

The Dataset

Property	Value
Total items	1,720 unique binary questions
Categories	12 (Color, Shape, Attribute, State, Emotion, Symmetry, Counting, Time, Celebrity, Living World, Traffic, Public Info)
Domains	Physical, Temporal, Quantitative, Biological, Social, Normative, Cultural
Source benchmarks	GQA, TextVQA, MMBench, OK-VQA, IconQA, ScienceQA, AI2D, MMMU, MLLM-CompBench, DocVQA
Episodes	3,440 (each item tested in original + swapped left/right)
Simulator	Simpler (based on ManiSkill / SAPIEN)

Large-Scale Study: 7 VLAs × 9 VLMs

Models Evaluated

Group	Models
VLA	π₀, OpenVLA, Magma, Xiaomi-Robotics-R0, InternVLA-M1, SmolVLA, SpatialVLA
VLM	Qwen2.5-VL, Ovis, PaliGemma, InternVL, and others (action-free text probe)
VQA co-trained	Magma, Xiaomi-Robotics-R0, InternVLA-M1
Robotics-only	OpenVLA, SpatialVLA, π₀

Key Findings

RQ1: Simple perceptions survive

Nearly all models handle Color and Shape well. Basic perceptual grounding is preserved after VLM→VLA adaptation.

RQ2: Semantic knowledge collapses

On Emotion, Attribute, State, Time, Symmetry, Counting, Celebrity, Living World, Traffic, Public Info - most VLAs hover at or near chance (50%). Magma is the only consistent exception.

Category	Magma	OpenVLA	π₀	SpatialVLA	InternVLA-M1
Shape	88%	80%	50%	70%	70%
Color	90%	80%	78%	65%	70%
Emotion	70%	50%	50%	53%	50%
Counting	53%	50%	50%	50%	50%
Symmetry	50%	50%	50%	50%	50%
Time	65%	50%	50%	53%	50%
Celebrity	75%	50%	50%	55%	50%
Traffic	73%	50%	50%	57%	50%

No evaluated VLA reaches above-random performance on Symmetry or Counting. These categories are uniformly challenging.

RQ3: The VLM→VLA gap is 20-40 pts

Comparing each VLA to its source VLM via a text-probe: source VLMs outperform their VLA counterparts by 20-40 percentage points on most knowledge categories. Something is lost in translation.

RQ4: Knowledge is still there - it just can't get out

Layerwise intent probing tells a revealing story:

Answer-relevant signals peak in middle layers of the VLM backbone
Performance attenuates toward the action head - often dropping to near chance
The model knows the answer internally, but the final layers that map representations → actions fail to preserve it

This suggests a bottleneck between semantic representation and action generation, not wholesale knowledge erasure.

RQ5: VQA co-training helps

Models trained with joint VLM+robotics supervision (Magma, Xiaomi-Robotics-R0, InternVLA-M1) consistently outperform robotics-only models on higher-level semantic categories. Continued vision-language exposure preserves actionable knowledge.

RQ6: Downstream fine-tuning can make it worse

Fine-tuning OpenVLA on a pick-and-place task (SFT or RL) did not improve knowledge-sensitive performance. Some categories actually dropped (State: -8 pts, Color: -5 pts). Standard task-specific adaptation may further bias models toward narrow motor optimization at the expense of general knowledge.

Why It Matters

This paper systematically opens a question the field has been hand-waving: do VLA models actually understand what they're acting on? The answer is mostly no, once you move beyond color and shape.

A robot that can grasp a cup but can't tell a dirty cup from a clean one, or a sad human from a neutral one, is fundamentally limited as a household assistant. The results make a clear case that the next generation of embodied models needs:

Training objectives that explicitly preserve semantic knowledge during robotics adaptation
Architectures that maintain answer-relevant signal through the action head
Evaluation that measures knowledge retention, not just manipulation success

Paper: arXiv:2606.19297
Project page: tttonyalpha.github.io/act2answer
License: CC BY 4.0

Does VLA Even Know the Basics?

The Problem: Knowledge Blindness in Embodied Agents

Act2Answer: A Lightweight Embodied Protocol

The authors introduce Act2Answer, a protocol that converts VLM-style QA into tabletop robot episodes:

An instruction is given (e.g., "Place the cube on the sad face")
Two candidate images sit on plates in the scene
The agent must place a cube on the correct answer
Success is measured by Soft Success Rate (SR) - proximity to the correct plate within tolerance ε

This strips away long-horizon planning and complex manipulation, isolating the knowledge-to-action pathway.

The Dataset

Property	Value
Total items	1,720 unique binary questions
Categories	12 (Color, Shape, Attribute, State, Emotion, Symmetry, Counting, Time, Celebrity, Living World, Traffic, Public Info)
Domains	Physical, Temporal, Quantitative, Biological, Social, Normative, Cultural
Source benchmarks	GQA, TextVQA, MMBench, OK-VQA, IconQA, ScienceQA, AI2D, MMMU, MLLM-CompBench, DocVQA
Episodes	3,440 (each item tested in original + swapped left/right)
Simulator	Simpler (based on ManiSkill / SAPIEN)

Large-Scale Study: 7 VLAs × 9 VLMs

Models Evaluated

Group	Models
VLA	π₀, OpenVLA, Magma, Xiaomi-Robotics-R0, InternVLA-M1, SmolVLA, SpatialVLA
VLM	Qwen2.5-VL, Ovis, PaliGemma, InternVL, and others (action-free text probe)
VQA co-trained	Magma, Xiaomi-Robotics-R0, InternVLA-M1
Robotics-only	OpenVLA, SpatialVLA, π₀

Key Findings

RQ1: Simple perceptions survive

Nearly all models handle Color and Shape well. Basic perceptual grounding is preserved after VLM→VLA adaptation.

RQ2: Semantic knowledge collapses

On Emotion, Attribute, State, Time, Symmetry, Counting, Celebrity, Living World, Traffic, Public Info - most VLAs hover at or near chance (50%). Magma is the only consistent exception.

Category	Magma	OpenVLA	π₀	SpatialVLA	InternVLA-M1
Shape	88%	80%	50%	70%	70%
Color	90%	80%	78%	65%	70%
Emotion	70%	50%	50%	53%	50%
Counting	53%	50%	50%	50%	50%
Symmetry	50%	50%	50%	50%	50%
Time	65%	50%	50%	53%	50%
Celebrity	75%	50%	50%	55%	50%
Traffic	73%	50%	50%	57%	50%

No evaluated VLA reaches above-random performance on Symmetry or Counting. These categories are uniformly challenging.

RQ3: The VLM→VLA gap is 20-40 pts

Comparing each VLA to its source VLM via a text-probe: source VLMs outperform their VLA counterparts by 20-40 percentage points on most knowledge categories. Something is lost in translation.

RQ4: Knowledge is still there - it just can't get out

Layerwise intent probing tells a revealing story:

Answer-relevant signals peak in middle layers of the VLM backbone
Performance attenuates toward the action head - often dropping to near chance
The model knows the answer internally, but the final layers that map representations → actions fail to preserve it

This suggests a bottleneck between semantic representation and action generation, not wholesale knowledge erasure.

RQ5: VQA co-training helps

RQ6: Downstream fine-tuning can make it worse

Why It Matters

This paper systematically opens a question the field has been hand-waving: do VLA models actually understand what they're acting on? The answer is mostly no, once you move beyond color and shape.

Training objectives that explicitly preserve semantic knowledge during robotics adaptation
Architectures that maintain answer-relevant signal through the action head
Evaluation that measures knowledge retention, not just manipulation success

Paper: arXiv:2606.19297
Project page: tttonyalpha.github.io/act2answer
License: CC BY 4.0

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

Does VLA Even Know the Basics?

The Problem: Knowledge Blindness in Embodied Agents

Act2Answer: A Lightweight Embodied Protocol

The Dataset

Large-Scale Study: 7 VLAs × 9 VLMs

Models Evaluated

Key Findings

RQ1: Simple perceptions survive

RQ2: Semantic knowledge collapses

RQ3: The VLM→VLA gap is 20-40 pts

RQ4: Knowledge is still there - it just can't get out

RQ5: VQA co-training helps

RQ6: Downstream fine-tuning can make it worse

Why It Matters

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

Does VLA Even Know the Basics?

The Problem: Knowledge Blindness in Embodied Agents

Act2Answer: A Lightweight Embodied Protocol

The Dataset

Large-Scale Study: 7 VLAs × 9 VLMs

Models Evaluated

Key Findings

RQ1: Simple perceptions survive

RQ2: Semantic knowledge collapses

RQ3: The VLM→VLA gap is 20-40 pts

RQ4: Knowledge is still there - it just can't get out

RQ5: VQA co-training helps

RQ6: Downstream fine-tuning can make it worse

Why It Matters