RoboGaze: Evaluating Robot World Models via Structured Vision-Language Analysis

Anonymous

Under Review

RoboGaze: Evaluating Robot World Models via Structured Vision-Language Analysis

A training-free, multi-agent VLM evaluator that turns generated robot-manipulation videos into structured, temporally localized diagnostic reports — telling you what failed, when, why, and how severely.

Minh-Loi Nguyen¹, Nghiem Tuong Diep², Hung Khang Nguyen², Minh Le², Doanh Le Thien^1,2, Hoang H. Tran¹, Dung Duy Le¹, Vu Duong¹, Daniel Sonntag³, An Thai Le^1,2,6, Duy M. H. Nguyen^3,4,5, Ngo Anh Vien^†1,2, Tran Van Nhiem^†2

¹Center for AI Research, VinUniversity ²VinRobotics ³DFKI ⁴University of Stuttgart ⁵Max Planck Research School for Intelligent Systems ⁶Technische Universität Darmstadt

^†Project Leads

PaperComing Soon CodeComing Soon RoboGazeBenchComing Soon See it in action

Abstract

Realistic robot videos can still be physically wrong — and a single score won't tell you why.

Description-F₁

+43

points over baselines

Temporal F₁×IoU

+37

points alignment gain

Clean-clip accuracy

<25% → 80%+

cry-wolf flaw, fixed

Validated on

382 clips

8 VLM backbones

Recent advances in robot world models enable synthetic video generation for embodied prediction and planning. However, evaluating these videos is challenging: visually realistic outputs often violate physical laws, temporal consistency, or task logic, while conventional metrics and monolithic Vision-Language Model (VLM) judges fail to generalize or provide precise diagnostic value.

We present RoboGaze, a training-free, multi-agent VLM framework that provides structured, interpretable evaluation for generated robot-manipulation videos. Given a task instruction and video, RoboGaze operates via a three-stage pipeline: task-scene grounding, dimension-specific specialist routing, and critic-based verification. It outputs temporally localized glitch reports categorized under a novel 6-dimension, 30-type robotics-specific taxonomy.

To benchmark RoboGaze, we introduce a human-validated dataset of 382 clips spanning simulated and real-world multi-view manipulation. Evaluating eight open-source and proprietary VLM backbones, RoboGaze dramatically outperforms zero-shot baselines — improving description-F₁ by up to +43 points and temporal alignment by up to +37 points, closing roughly 85% of the gap to the human ceiling. Its critic verifier mitigates the “cry-wolf” false-positive flaw of standard VLMs, lifting clean-clip accuracy from under 25% to over 80%.

Contributions

Four artifacts for rigorous robot-video evaluation.

A robotics-specific glitch taxonomy

A hierarchical 6×30 taxonomy of robot-video failures spanning task execution, instruction consistency, object interactions, robot behavior, physical plausibility, and visual quality.

RoboGazeBench

A human-validated benchmark of 382 generated robot-manipulation videos with temporally localized glitch annotations, severity labels, and diagnostic descriptions across simulation and real-world domains.

RoboGaze, the evaluator

A training-free, model-agnostic multi-agent VLM evaluator that performs temporally grounded failure diagnosis and produces structured diagnostic reports instead of scalar quality scores.

Comprehensive empirical validation

Across eight proprietary and open-source VLMs, RoboGaze shows substantially stronger agreement with human judgments and more accurate failure localization than existing prompting and evaluation baselines.

Method

Three stages: ground the task, route to specialists, then verify.

0:00

The RoboGaze Pipeline

Phase 1

Input & Context Grounding

Parse the instruction and initial frame into task memory (objective, subtasks, completion criteria) and scene memory (objects, robot parts, layout, uncertainty); split the video into clips and group them into subtask segments.

Phase 2

Candidate Discovery & Specialist Analysis

A router predicts which of the six dimensions are plausible per subtask (sparse routing avoids noisy diagnoses); dimension specialists emit hypotheses with detection, reasoning, temporal span, evidence, severity, and confidence.

Phase 3

Verification & Reporting

A critic verifier re-examines each hypothesis — accept, reject, or merge — against visual evidence, task context, and scene state, then synthesizes the survivors into one coherent report. This is what kills the cry-wolf false positives.

The RoboGaze framework. A three-phase pipeline for video-generation diagnosis: (1) extracting task and scene context memories; (2) routing suspicious temporal spans to six dimension-specific specialists to generate glitch hypotheses; and (3) verifying and synthesizing hypotheses into a final structured glitch report.

Taxonomy

Six dimensions. Thirty failure types. One interpretable map.

Failures are organized along six coarse dimensions, each refined into fine-grained types — 30 in total — spanning the full execution stack of a manipulation rollout. Specialists reason at the fine-grained type level; headline results aggregate at the dimension level.

Task Progress

Subtask completion and execution order.

Instruction Consistency

Alignment with the commanded task.

Object–Scene

Object existence, identity, and interactions.

Robot-Body

Robot kinematics and self-consistency.

Physical Plausibility

Contacts, collisions, and dynamics.

Visual Quality

Rendering, artifacts, and temporal coherence.

Key Findings

What we learned evaluating eight VLMs on robot videos.

Finding 01

Monolithic VLM judges are brittle — they cry wolf on clean videos

Vanilla and chain-of-thought VLMs recognize suspicious motion but cannot reliably decide whether it constitutes a real, task-grounded failure — so they hallucinate glitches in clean clips. Clean-clip accuracy sits under 25% for most backbones; RoboGaze lifts it past 80%.

Clean-clip accuracy (%) — higher is better. Vanilla vs. Chain-of-Thought vs. RoboGaze, per backbone.

Finding 02

RoboGaze improves diagnostic agreement across every model family

Wrapping the same backbone with RoboGaze consistently improves semantic agreement with human diagnoses — for both proprietary and open-source models. The improvement comes from evaluator structure, not backbone scale: even the weakest open model wrapped by RoboGaze beats the strongest proprietary model prompted directly.

Description-F₁ — agreement with human diagnostic descriptions, per backbone.

Per-dimension Description-F₁ for Gemini 3.1 Pro — Human ceiling vs. Vanilla vs. RoboGaze.

Per-dimension breakdown. Description-F₁ for Gemini 3.1 Pro across the six dimensions. The largest gains appear on Task Progress, Instruction Consistency, Object–Scene, and Robot-Body — where diagnosis depends on relational and causal reasoning.

Finding 03

The critic verifier is the single most important component

Every structural component contributes, but ablating the critic verifier causes by far the largest degradation. The verifier's job is not to propose more glitches — it is to reject weak ones before they enter the report, checking each hypothesis against visual evidence, task context, and temporal consistency. This is what drives RoboGaze's clean-clip reliability.

Ablation on Gemma4-31B — removing each component (Description-F₁). Removing the verifier hurts most.

Experiments & Results

Catching glitches in the act, across three datasets.

RoboGazeBench — 382 clips, three splits

154

GR1-Sim

DomainSimulated
ViewSingle
Duration8–9 s

100

GR1-Real

DomainReal-initialized
ViewSingle
Duration5–6 s

128

DROID-MV

DomainReal
ViewMulti-view
Duration17–18 s

Detected glitches on dataset examples

Each clip is a generated robot video; the panel below it is the structured report RoboGaze produces — the dimension, fine-grained type, temporal span, and severity of every failure it finds. The timeline marks when each glitch occurs — click any row to replay that moment.

Main results on RoboGazeBench (averaged across the three splits)

V, CoT, and RG denote vanilla prompting, chain-of-thought prompting, and the same backbone wrapped by RoboGaze. Green badges show the absolute gain of RoboGaze over the stronger baseline. Higher is better; Clean is clean-clip accuracy.
Family	Model	Desc. F₁			mIoU			F₁×IoU			Clean
Family	Model	V	CoT	RGOurs	V	CoT	RGOurs	V	CoT	RGOurs	V	CoT	RGOurs
Prop.	Gemini 3.1 Pro	25.4	32.4	67.9+35.5	.39	.43	.64+.21	9.8	13.9	43.7+29.8	.13	.21	.92+.71
	GPT-5.5	23.9	30.5	65.1+34.6	.37	.41	.63+.22	8.8	12.5	41.1+28.6	.09	.18	.90+.72
	Gemini 3.1 Flash	20.2	26.1	57.8+31.7	.35	.39	.61+.22	7.0	10.2	35.0+24.8	.08	.16	.87+.71
	Claude Sonnet 4.6	21.4	27.6	60.3+32.7	.36	.40	.61+.21	7.7	11.0	36.8+25.8	.11	.20	.89+.69
Open	Gemma4-31B	17.6	23.4	56.6+33.2	.33	.37	.66+.29	5.7	8.7	34.5+25.8	.03	.12	.86+.74
	Qwen3.6-35B	16.7	22.3	53.1+30.8	.32	.36	.60+.24	5.4	8.2	31.6+23.4	.16	.23	.84+.61
	LLaVA-OV-2-8B	14.2	19.5	47.9+28.4	.30	.33	.56+.23	4.3	6.5	26.9+20.4	.18	.25	.82+.57
	InternVL3.5-38B	15.5	21.0	50.1+29.1	.31	.35	.58+.23	4.8	7.4	28.9+21.5	.04	.13	.81+.68
Human	ceiling	75.5			0.71			47.1			0.94

Qualitative Analysis

Where verification changes the verdict.

Representative cases: (a) the verifier rescues a true glitch a vanilla judge mislabels, (b) a vanilla false positive is rejected, and (c) a failure both methods detect but localize differently.

Qualitative comparison of RoboGaze and baseline judges

Vanilla baselines vs. RoboGaze. By contesting specialist hypotheses before they enter the report, RoboGaze rejects false positives and recovers true glitches with sharper temporal localization.

Citation

BibTeX

@article{nguyen2026robogaze,
  title   = {RoboGaze: Evaluating Robot World Models via
             Structured Vision-Language Analysis},
  author  = {Nguyen, Minh-Loi and Diep, Nghiem Tuong and Nguyen, Hung Khang and
             Le, Minh and Le Thien, Doanh and Tran, Hoang H. and Le, Dung Duy and
             Duong, Vu and Sonntag, Daniel and Le, An Thai and Nguyen, Duy M. H. and
             Vien, Ngo Anh and Nhiem, Tran Van},
  journal = {arXiv preprint},
  year    = {2026}
}