VLA / Robotics / Commonsense
Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models
Nikita Kachaev*, Andrey Moskalenko*, Matvey Skripkin, Nikita Kurlaev, Daria Pugacheva, Albina Burlova, Mikhail Kolosov, Denis Shepelev, Andrey Kuznetsov, Elena Tutubalina, Aleksandr I. Panov, Alexey K. Kovalev, Vlad Shakhuro · CogAI Lab, FusionBrain Lab, IAI MSU, Lomonosov MSU, NUST MISIS, Applied AI Institute, HSE University, Generalizable AI Systems, ISP RAS, MIRAI
Does VLA Even Know the Basics?
The Problem: Knowledge Blindness in Embodied Agents
Vision-Language-Action (VLA) models are built by fine-tuning powerful pretrained VLMs on robotics data. The implicit assumption is that commonsense and world knowledge survive intact. This paper asks a uncomfortable question: does it actually?
The issue is that VLA benchmarks (LIBERO, CALVIN, VLABench) measure only manipulation success - whether the arm picked up the right object. Failure is ambiguous: did the agent not know the cup is fragile, or did it just miss the grasp? Text-based probing of the VLM backbone sidesteps the real question - can the VLA act on the knowledge?
Act2Answer: A Lightweight Embodied Protocol
The authors introduce Act2Answer, a protocol that converts VLM-style QA into tabletop robot episodes:
- An instruction is given (e.g., "Place the cube on the sad face")
- Two candidate images sit on plates in the scene
- The agent must place a cube on the correct answer
- Success is measured by Soft Success Rate (SR) - proximity to the correct plate within tolerance ε
This strips away long-horizon planning and complex manipulation, isolating the knowledge-to-action pathway.
The Dataset
| Property | Value |
|---|---|
| Total items | 1,720 unique binary questions |
| Categories | 12 (Color, Shape, Attribute, State, Emotion, Symmetry, Counting, Time, Celebrity, Living World, Traffic, Public Info) |
| Domains | Physical, Temporal, Quantitative, Biological, Social, Normative, Cultural |
| Source benchmarks | GQA, TextVQA, MMBench, OK-VQA, IconQA, ScienceQA, AI2D, MMMU, MLLM-CompBench, DocVQA |
| Episodes | 3,440 (each item tested in original + swapped left/right) |
| Simulator | Simpler (based on ManiSkill / SAPIEN) |
Large-Scale Study: 7 VLAs × 9 VLMs
Models Evaluated
| Group | Models |
|---|---|
| VLA | π₀, OpenVLA, Magma, Xiaomi-Robotics-R0, InternVLA-M1, SmolVLA, SpatialVLA |
| VLM | Qwen2.5-VL, Ovis, PaliGemma, InternVL, and others (action-free text probe) |
| VQA co-trained | Magma, Xiaomi-Robotics-R0, InternVLA-M1 |
| Robotics-only | OpenVLA, SpatialVLA, π₀ |
Key Findings
RQ1: Simple perceptions survive
Nearly all models handle Color and Shape well. Basic perceptual grounding is preserved after VLM→VLA adaptation.
RQ2: Semantic knowledge collapses
On Emotion, Attribute, State, Time, Symmetry, Counting, Celebrity, Living World, Traffic, Public Info - most VLAs hover at or near chance (50%). Magma is the only consistent exception.
| Category | Magma | OpenVLA | π₀ | SpatialVLA | InternVLA-M1 |
|---|---|---|---|---|---|
| Shape | 88% | 80% | 50% | 70% | 70% |
| Color | 90% | 80% | 78% | 65% | 70% |
| Emotion | 70% | 50% | 50% | 53% | 50% |
| Counting | 53% | 50% | 50% | 50% | 50% |
| Symmetry | 50% | 50% | 50% | 50% | 50% |
| Time | 65% | 50% | 50% | 53% | 50% |
| Celebrity | 75% | 50% | 50% | 55% | 50% |
| Traffic | 73% | 50% | 50% | 57% | 50% |
No evaluated VLA reaches above-random performance on Symmetry or Counting. These categories are uniformly challenging.
RQ3: The VLM→VLA gap is 20-40 pts
Comparing each VLA to its source VLM via a text-probe: source VLMs outperform their VLA counterparts by 20-40 percentage points on most knowledge categories. Something is lost in translation.
RQ4: Knowledge is still there - it just can't get out
Layerwise intent probing tells a revealing story:
- Answer-relevant signals peak in middle layers of the VLM backbone
- Performance attenuates toward the action head - often dropping to near chance
- The model knows the answer internally, but the final layers that map representations → actions fail to preserve it
This suggests a bottleneck between semantic representation and action generation, not wholesale knowledge erasure.
RQ5: VQA co-training helps
Models trained with joint VLM+robotics supervision (Magma, Xiaomi-Robotics-R0, InternVLA-M1) consistently outperform robotics-only models on higher-level semantic categories. Continued vision-language exposure preserves actionable knowledge.
RQ6: Downstream fine-tuning can make it worse
Fine-tuning OpenVLA on a pick-and-place task (SFT or RL) did not improve knowledge-sensitive performance. Some categories actually dropped (State: -8 pts, Color: -5 pts). Standard task-specific adaptation may further bias models toward narrow motor optimization at the expense of general knowledge.
Why It Matters
This paper systematically opens a question the field has been hand-waving: do VLA models actually understand what they're acting on? The answer is mostly no, once you move beyond color and shape.
A robot that can grasp a cup but can't tell a dirty cup from a clean one, or a sad human from a neutral one, is fundamentally limited as a household assistant. The results make a clear case that the next generation of embodied models needs:
- Training objectives that explicitly preserve semantic knowledge during robotics adaptation
- Architectures that maintain answer-relevant signal through the action head
- Evaluation that measures knowledge retention, not just manipulation success
Paper: arXiv:2606.19297
Project page: tttonyalpha.github.io/act2answer
License: CC BY 4.0