Under Review

RoboGaze: Evaluating Robot World Models via Structured Vision-Language Analysis

A training-free, multi-agent VLM evaluator that turns generated robot-manipulation videos into structured, temporally localized diagnostic reports — telling you what failed, when, why, and how severely.

Minh-Loi Nguyen1, Nghiem Tuong Diep2, Hung Khang Nguyen2, Minh Le2, Doanh Le Thien1,2, Hoang H. Tran1, Dung Duy Le1, Vu Duong1, Daniel Sonntag3, An Thai Le1,2,6, Duy M. H. Nguyen3,4,5, Ngo Anh Vien†1,2, Tran Van Nhiem†2

1Center for AI Research, VinUniversity   2VinRobotics   3DFKI   4University of Stuttgart   5Max Planck Research School for Intelligent Systems   6Technische Universität Darmstadt

Project Leads

DFKI
University of Stuttgart
Max Planck Research School for Intelligent Systems (IMPRS-IS)
Technische Universität Darmstadt
VinUniversity
VinRobotics
RoboGaze vs. monolithic VLM judges

RoboGaze vs. monolithic VLM judges. Where standard scalar evaluators miss fine-grained physical anomalies like object penetration (A, B), RoboGaze delivers interpretable, temporally localized failure diagnostics — approaching the human ceiling on both description-F1 and clean-clip accuracy (C).

Abstract

Realistic robot videos can still be physically wrong — and a single score won't tell you why.

Description-F1

+43

points over baselines

Temporal F1×IoU

+37

points alignment gain

Clean-clip accuracy

<25% → 80%+

cry-wolf flaw, fixed

Validated on

382 clips

8 VLM backbones

Recent advances in robot world models enable synthetic video generation for embodied prediction and planning. However, evaluating these videos is challenging: visually realistic outputs often violate physical laws, temporal consistency, or task logic, while conventional metrics and monolithic Vision-Language Model (VLM) judges fail to generalize or provide precise diagnostic value.

We present RoboGaze, a training-free, multi-agent VLM framework that provides structured, interpretable evaluation for generated robot-manipulation videos. Given a task instruction and video, RoboGaze operates via a three-stage pipeline: task-scene grounding, dimension-specific specialist routing, and critic-based verification. It outputs temporally localized glitch reports categorized under a novel 6-dimension, 30-type robotics-specific taxonomy.

To benchmark RoboGaze, we introduce a human-validated dataset of 382 clips spanning simulated and real-world multi-view manipulation. Evaluating eight open-source and proprietary VLM backbones, RoboGaze dramatically outperforms zero-shot baselines — improving description-F1 by up to +43 points and temporal alignment by up to +37 points, closing roughly 85% of the gap to the human ceiling. Its critic verifier mitigates the “cry-wolf” false-positive flaw of standard VLMs, lifting clean-clip accuracy from under 25% to over 80%.

Contributions

Four artifacts for rigorous robot-video evaluation.

1

A robotics-specific glitch taxonomy

A hierarchical 6×30 taxonomy of robot-video failures spanning task execution, instruction consistency, object interactions, robot behavior, physical plausibility, and visual quality.

2

RoboGazeBench

A human-validated benchmark of 382 generated robot-manipulation videos with temporally localized glitch annotations, severity labels, and diagnostic descriptions across simulation and real-world domains.

3

RoboGaze, the evaluator

A training-free, model-agnostic multi-agent VLM evaluator that performs temporally grounded failure diagnosis and produces structured diagnostic reports instead of scalar quality scores.

4

Comprehensive empirical validation

Across eight proprietary and open-source VLMs, RoboGaze shows substantially stronger agreement with human judgments and more accurate failure localization than existing prompting and evaluation baselines.

Method

Three stages: ground the task, route to specialists, then verify.

0:00

The RoboGaze Pipeline

Phase 1

Input & Context Grounding

Parse the instruction and initial frame into task memory (objective, subtasks, completion criteria) and scene memory (objects, robot parts, layout, uncertainty); split the video into clips and group them into subtask segments.

Phase 2

Candidate Discovery & Specialist Analysis

A router predicts which of the six dimensions are plausible per subtask (sparse routing avoids noisy diagnoses); dimension specialists emit hypotheses with detection, reasoning, temporal span, evidence, severity, and confidence.

Phase 3

Verification & Reporting

A critic verifier re-examines each hypothesis — accept, reject, or merge — against visual evidence, task context, and scene state, then synthesizes the survivors into one coherent report. This is what kills the cry-wolf false positives.

The RoboGaze three-phase framework

The RoboGaze framework. A three-phase pipeline for video-generation diagnosis: (1) extracting task and scene context memories; (2) routing suspicious temporal spans to six dimension-specific specialists to generate glitch hypotheses; and (3) verifying and synthesizing hypotheses into a final structured glitch report.

Taxonomy

Six dimensions. Thirty failure types. One interpretable map.

Failures are organized along six coarse dimensions, each refined into fine-grained types — 30 in total — spanning the full execution stack of a manipulation rollout. Specialists reason at the fine-grained type level; headline results aggregate at the dimension level.

Task Progress

Subtask completion and execution order.

Instruction Consistency

Alignment with the commanded task.

Object–Scene

Object existence, identity, and interactions.

Robot-Body

Robot kinematics and self-consistency.

Physical Plausibility

Contacts, collisions, and dynamics.

Visual Quality

Rendering, artifacts, and temporal coherence.

Key Findings

What we learned evaluating eight VLMs on robot videos.

Finding 01

Monolithic VLM judges are brittle — they cry wolf on clean videos

Vanilla and chain-of-thought VLMs recognize suspicious motion but cannot reliably decide whether it constitutes a real, task-grounded failure — so they hallucinate glitches in clean clips. Clean-clip accuracy sits under 25% for most backbones; RoboGaze lifts it past 80%.

Clean-clip accuracy (%) — higher is better. Vanilla vs. Chain-of-Thought vs. RoboGaze, per backbone.

Finding 02

RoboGaze improves diagnostic agreement across every model family

Wrapping the same backbone with RoboGaze consistently improves semantic agreement with human diagnoses — for both proprietary and open-source models. The improvement comes from evaluator structure, not backbone scale: even the weakest open model wrapped by RoboGaze beats the strongest proprietary model prompted directly.

Description-F1 — agreement with human diagnostic descriptions, per backbone.

Per-dimension Description-F1 for Gemini 3.1 Pro — Human ceiling vs. Vanilla vs. RoboGaze.

Per-dimension breakdown. Description-F1 for Gemini 3.1 Pro across the six dimensions. The largest gains appear on Task Progress, Instruction Consistency, Object–Scene, and Robot-Body — where diagnosis depends on relational and causal reasoning.

Finding 03

The critic verifier is the single most important component

Every structural component contributes, but ablating the critic verifier causes by far the largest degradation. The verifier's job is not to propose more glitches — it is to reject weak ones before they enter the report, checking each hypothesis against visual evidence, task context, and temporal consistency. This is what drives RoboGaze's clean-clip reliability.

Ablation on Gemma4-31B — removing each component (Description-F1). Removing the verifier hurts most.

Experiments & Results

Catching glitches in the act, across three datasets.

RoboGazeBench — 382 clips, three splits

154
GR1-Sim
  • DomainSimulated
  • ViewSingle
  • Duration8–9 s
100
GR1-Real
  • DomainReal-initialized
  • ViewSingle
  • Duration5–6 s
128
DROID-MV
  • DomainReal
  • ViewMulti-view
  • Duration17–18 s

Detected glitches on dataset examples

Each clip is a generated robot video; the panel below it is the structured report RoboGaze produces — the dimension, fine-grained type, temporal span, and severity of every failure it finds. The timeline marks when each glitch occurs — click any row to replay that moment.

Main results on RoboGazeBench (averaged across the three splits)

V, CoT, and RG denote vanilla prompting, chain-of-thought prompting, and the same backbone wrapped by RoboGaze. Green badges show the absolute gain of RoboGaze over the stronger baseline. Higher is better; Clean is clean-clip accuracy.
FamilyModel Desc. F1mIoUF1×IoUClean
VCoTRGOurs VCoTRGOurs VCoTRGOurs VCoTRGOurs
Prop.Gemini 3.1 Pro 25.432.467.9+35.5 .39.43.64+.21 9.813.943.7+29.8 .13.21.92+.71
GPT-5.5 23.930.565.1+34.6 .37.41.63+.22 8.812.541.1+28.6 .09.18.90+.72
Gemini 3.1 Flash 20.226.157.8+31.7 .35.39.61+.22 7.010.235.0+24.8 .08.16.87+.71
Claude Sonnet 4.6 21.427.660.3+32.7 .36.40.61+.21 7.711.036.8+25.8 .11.20.89+.69
OpenGemma4-31B 17.623.456.6+33.2 .33.37.66+.29 5.78.734.5+25.8 .03.12.86+.74
Qwen3.6-35B 16.722.353.1+30.8 .32.36.60+.24 5.48.231.6+23.4 .16.23.84+.61
LLaVA-OV-2-8B 14.219.547.9+28.4 .30.33.56+.23 4.36.526.9+20.4 .18.25.82+.57
InternVL3.5-38B 15.521.050.1+29.1 .31.35.58+.23 4.87.428.9+21.5 .04.13.81+.68
Humanceiling 75.50.7147.10.94

Qualitative Analysis

Where verification changes the verdict.

Representative cases: (a) the verifier rescues a true glitch a vanilla judge mislabels, (b) a vanilla false positive is rejected, and (c) a failure both methods detect but localize differently.

Qualitative comparison of RoboGaze and baseline judges

Vanilla baselines vs. RoboGaze. By contesting specialist hypotheses before they enter the report, RoboGaze rejects false positives and recovers true glitches with sharper temporal localization.

Citation

BibTeX

@article{nguyen2026robogaze,
  title   = {RoboGaze: Evaluating Robot World Models via
             Structured Vision-Language Analysis},
  author  = {Nguyen, Minh-Loi and Diep, Nghiem Tuong and Nguyen, Hung Khang and
             Le, Minh and Le Thien, Doanh and Tran, Hoang H. and Le, Dung Duy and
             Duong, Vu and Sonntag, Daniel and Le, An Thai and Nguyen, Duy M. H. and
             Vien, Ngo Anh and Nhiem, Tran Van},
  journal = {arXiv preprint},
  year    = {2026}
}
×