Under Review
A training-free, multi-agent VLM evaluator that turns generated robot-manipulation videos into structured, temporally localized diagnostic reports — telling you what failed, when, why, and how severely.
1Center for AI Research, VinUniversity 2VinRobotics 3DFKI 4University of Stuttgart 5Max Planck Research School for Intelligent Systems 6Technische Universität Darmstadt
†Project Leads
RoboGaze vs. monolithic VLM judges. Where standard scalar evaluators miss fine-grained physical anomalies like object penetration (A, B), RoboGaze delivers interpretable, temporally localized failure diagnostics — approaching the human ceiling on both description-F1 and clean-clip accuracy (C).
Abstract
Description-F1
+43
points over baselines
Temporal F1×IoU
+37
points alignment gain
Clean-clip accuracy
<25% → 80%+
cry-wolf flaw, fixed
Validated on
382 clips
8 VLM backbones
Recent advances in robot world models enable synthetic video generation for embodied prediction and planning. However, evaluating these videos is challenging: visually realistic outputs often violate physical laws, temporal consistency, or task logic, while conventional metrics and monolithic Vision-Language Model (VLM) judges fail to generalize or provide precise diagnostic value.
We present RoboGaze, a training-free, multi-agent VLM framework that provides structured, interpretable evaluation for generated robot-manipulation videos. Given a task instruction and video, RoboGaze operates via a three-stage pipeline: task-scene grounding, dimension-specific specialist routing, and critic-based verification. It outputs temporally localized glitch reports categorized under a novel 6-dimension, 30-type robotics-specific taxonomy.
To benchmark RoboGaze, we introduce a human-validated dataset of 382 clips spanning simulated and real-world multi-view manipulation. Evaluating eight open-source and proprietary VLM backbones, RoboGaze dramatically outperforms zero-shot baselines — improving description-F1 by up to +43 points and temporal alignment by up to +37 points, closing roughly 85% of the gap to the human ceiling. Its critic verifier mitigates the “cry-wolf” false-positive flaw of standard VLMs, lifting clean-clip accuracy from under 25% to over 80%.
Contributions
A robotics-specific glitch taxonomy
A hierarchical 6×30 taxonomy of robot-video failures spanning task execution, instruction consistency, object interactions, robot behavior, physical plausibility, and visual quality.
RoboGazeBench
A human-validated benchmark of 382 generated robot-manipulation videos with temporally localized glitch annotations, severity labels, and diagnostic descriptions across simulation and real-world domains.
RoboGaze, the evaluator
A training-free, model-agnostic multi-agent VLM evaluator that performs temporally grounded failure diagnosis and produces structured diagnostic reports instead of scalar quality scores.
Comprehensive empirical validation
Across eight proprietary and open-source VLMs, RoboGaze shows substantially stronger agreement with human judgments and more accurate failure localization than existing prompting and evaluation baselines.
Method
The RoboGaze Pipeline
The RoboGaze framework. A three-phase pipeline for video-generation diagnosis: (1) extracting task and scene context memories; (2) routing suspicious temporal spans to six dimension-specific specialists to generate glitch hypotheses; and (3) verifying and synthesizing hypotheses into a final structured glitch report.
Taxonomy
Failures are organized along six coarse dimensions, each refined into fine-grained types — 30 in total — spanning the full execution stack of a manipulation rollout. Specialists reason at the fine-grained type level; headline results aggregate at the dimension level.
Task Progress
Subtask completion and execution order.
Instruction Consistency
Alignment with the commanded task.
Object–Scene
Object existence, identity, and interactions.
Robot-Body
Robot kinematics and self-consistency.
Physical Plausibility
Contacts, collisions, and dynamics.
Visual Quality
Rendering, artifacts, and temporal coherence.
Key Findings
Finding 01
Monolithic VLM judges are brittle — they cry wolf on clean videos
Vanilla and chain-of-thought VLMs recognize suspicious motion but cannot reliably decide whether it constitutes a real, task-grounded failure — so they hallucinate glitches in clean clips. Clean-clip accuracy sits under 25% for most backbones; RoboGaze lifts it past 80%.
Clean-clip accuracy (%) — higher is better. Vanilla vs. Chain-of-Thought vs. RoboGaze, per backbone.
Finding 02
RoboGaze improves diagnostic agreement across every model family
Wrapping the same backbone with RoboGaze consistently improves semantic agreement with human diagnoses — for both proprietary and open-source models. The improvement comes from evaluator structure, not backbone scale: even the weakest open model wrapped by RoboGaze beats the strongest proprietary model prompted directly.
Description-F1 — agreement with human diagnostic descriptions, per backbone.
Per-dimension Description-F1 for Gemini 3.1 Pro — Human ceiling vs. Vanilla vs. RoboGaze.
Per-dimension breakdown. Description-F1 for Gemini 3.1 Pro across the six dimensions. The largest gains appear on Task Progress, Instruction Consistency, Object–Scene, and Robot-Body — where diagnosis depends on relational and causal reasoning.
Finding 03
The critic verifier is the single most important component
Every structural component contributes, but ablating the critic verifier causes by far the largest degradation. The verifier's job is not to propose more glitches — it is to reject weak ones before they enter the report, checking each hypothesis against visual evidence, task context, and temporal consistency. This is what drives RoboGaze's clean-clip reliability.
Ablation on Gemma4-31B — removing each component (Description-F1). Removing the verifier hurts most.
Experiments & Results
RoboGazeBench — 382 clips, three splits
Detected glitches on dataset examples
Each clip is a generated robot video; the panel below it is the structured report RoboGaze produces — the dimension, fine-grained type, temporal span, and severity of every failure it finds. The timeline marks when each glitch occurs — click any row to replay that moment.
Main results on RoboGazeBench (averaged across the three splits)
| Family | Model | Desc. F1 | mIoU | F1×IoU | Clean | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| V | CoT | RGOurs | V | CoT | RGOurs | V | CoT | RGOurs | V | CoT | RGOurs | ||
| Prop. | Gemini 3.1 Pro | 25.4 | 32.4 | 67.9+35.5 | .39 | .43 | .64+.21 | 9.8 | 13.9 | 43.7+29.8 | .13 | .21 | .92+.71 |
| GPT-5.5 | 23.9 | 30.5 | 65.1+34.6 | .37 | .41 | .63+.22 | 8.8 | 12.5 | 41.1+28.6 | .09 | .18 | .90+.72 | |
| Gemini 3.1 Flash | 20.2 | 26.1 | 57.8+31.7 | .35 | .39 | .61+.22 | 7.0 | 10.2 | 35.0+24.8 | .08 | .16 | .87+.71 | |
| Claude Sonnet 4.6 | 21.4 | 27.6 | 60.3+32.7 | .36 | .40 | .61+.21 | 7.7 | 11.0 | 36.8+25.8 | .11 | .20 | .89+.69 | |
| Open | Gemma4-31B | 17.6 | 23.4 | 56.6+33.2 | .33 | .37 | .66+.29 | 5.7 | 8.7 | 34.5+25.8 | .03 | .12 | .86+.74 |
| Qwen3.6-35B | 16.7 | 22.3 | 53.1+30.8 | .32 | .36 | .60+.24 | 5.4 | 8.2 | 31.6+23.4 | .16 | .23 | .84+.61 | |
| LLaVA-OV-2-8B | 14.2 | 19.5 | 47.9+28.4 | .30 | .33 | .56+.23 | 4.3 | 6.5 | 26.9+20.4 | .18 | .25 | .82+.57 | |
| InternVL3.5-38B | 15.5 | 21.0 | 50.1+29.1 | .31 | .35 | .58+.23 | 4.8 | 7.4 | 28.9+21.5 | .04 | .13 | .81+.68 | |
| Human | ceiling | 75.5 | 0.71 | 47.1 | 0.94 | ||||||||
Qualitative Analysis
Representative cases: (a) the verifier rescues a true glitch a vanilla judge mislabels, (b) a vanilla false positive is rejected, and (c) a failure both methods detect but localize differently.
Vanilla baselines vs. RoboGaze. By contesting specialist hypotheses before they enter the report, RoboGaze rejects false positives and recovers true glitches with sharper temporal localization.
Citation
@article{nguyen2026robogaze,
title = {RoboGaze: Evaluating Robot World Models via
Structured Vision-Language Analysis},
author = {Nguyen, Minh-Loi and Diep, Nghiem Tuong and Nguyen, Hung Khang and
Le, Minh and Le Thien, Doanh and Tran, Hoang H. and Le, Dung Duy and
Duong, Vu and Sonntag, Daniel and Le, An Thai and Nguyen, Duy M. H. and
Vien, Ngo Anh and Nhiem, Tran Van},
journal = {arXiv preprint},
year = {2026}
}