WorldArena: A Unified Benchmark for Evaluating Perception and Functional Utility of Embodied World Models

Overview

WorldArena is a unified benchmark designed to systematically evaluate embodied world models across both perceptual and functional dimensions. WorldArena assesses models through three dimensions: video perception quality, measured with sixteen metrics across six sub-dimensions; embodied task functionality, which evaluates world models as synthetic data engines, policy evaluators, and action planners. Furthermore, we propose EWMScore, a holistic metric integrating multi-dimensional performance into a single interpretable index.This work provides a framework for tracking progress toward truly functional world models in embodied AI.

Comparison with Existing Benchmarks

Benchmark	Video Quality						Embodied Tasks			Human
Benchmark	Visual Quality	Motion Quality	Content Consist.	Physics Adher.	Control ability	3D Acc.	Data Engine	Policy Eval.	Action Planner	Human
WorldModelBench	✕	✕	✕	✓	✓	✕	✕	✕	✕	✓
WorldSimBench	✓	✓	✓	✕	✓	✕	✕	✕	✓	✓
WorldScore	✓	✓	✓	✕	✓	✓	✕	✕	✕	✓
4DWorldBench	✓	✓	✓	✓	✓	✓	✕	✕	✕	✓
EWMBench	✕	✓	✓	✓	✓	✕	✕	✕	✕	✓
WorldEval	✕	✓	✓	✕	✕	✕	✕	✓	✕	✕
World-in-World	✓	✓	✕	✕	✓	✕	✕	✕	✓	✕
WoW-World-Eval	✓	✓	✓	✓	✓	✓	✕	✕	✓	✓
WorldArena	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓

Table 1: Comprehensive comparison of WorldArena with existing world model benchmarks.

Evaluation Framework

WorldArena provides a comprehensive evaluation framework that integrates both video quality and functional utility.

Video Quality Evaluation

WorldArena measures multi-faceted video quality, comprising 16 numerical metrics across 6 key sub-dimensions,, including visual quality, motion quality, content consistency, physics adherence, 3D accuracy, and controllability:

Figure 1: Six dimensions of perceptual evaluation with 16 specific metrics.

Visual Quality

Image Quality
Aesthetic Quality
JEPA Similarity

Motion Quality

Dynamic Degree
Flow Score
Motion Smoothness

Content Consistency

Subject Consistency
Background Consistency
Photometric Consistency

Physics Adherence

Interaction Quality
Trajectory Accuracy

3D Accuracy

Depth Accuracy
Perspectivity

Controllability

Instruction Following
Semantic Alignment
Action Following

EWMScore is computed as the arithmetic mean of the 16 normalized base metrics, each linearly scaled to the range [0, 100]. It provides a unified evaluation measure, where higher scores indicate stronger overall performance.

Embodied Task Functionality

WorldArena evaluates model performance in three core downstream tasks: Data Engine, Policy Evaluator, and Action Planner.

Figure 2: Embodied Task Functionality Evaluation Framework.

Embodied Data Engine

World models can generate future observations based on external instructions, enabling synthetic data generation to supplement training data for downstream embodied policy models and alleviate the scarcity of real-world data. In this part, we treat world models as embodied data synthesis engines and evaluate their performance by measuring the gain they provide to policy models. We employ a two-phase training procedure. In the first phase, we fine-tune the world model on the RobotTwin 2.0 dataset and generate synthetic videos conditioned on the first frame and external instructions. In the second phase, we freeze the world model’s weights and integrate an inverse dynamics model (IDM) to extract actions from video features. This process produces paired video-action sequences. We then evaluate the impact of world model–generated synthetic data by training a baseline π0.5 policy model with varying amounts of synthetic data. The performance gain of the policy model reflects the world model’s capability to enhance policy learning.

Embodied Policy Evaluator

We assess the capability of world models as environment proxies for evaluating policy performance. We train a series of policy models π0.5 with varying capabilities using the RoboTwin 2.0 dataset. These models are evaluated by interacting with an action-controllable world model, generating observation videos through a rollout process that continues until it exceeds 20% more frames than the corresponding ground truth video. Task success is evaluated using a VLM, which determines whether the embodied task was executed successfully. The success rate from the world model's evaluation is compared to that from the RoboTwin simulator. A high correlation between the two suggests effective simulation of real-world dynamics, while a low correlation indicates a mismatch in environmental transition simulation.

Embodied Action Planner

By predicting future state transitions, world models can function as the action-planning "brain" of an embodied agent. In this part, we investigate the ability of world models to execute embodied tasks in a closed-loop manner. Similar to the data synthesis engine setup, we pair the world model with an inverse dynamics model, where the world model takes textual instructions and the initial frame as input and outputs the corresponding action sequence for future operations. This sequence is then executed in the RoboTwin simulator, and the task success rate is measured to evaluate the world model's performance in closed-loop action execution.

Leaderboard

Visualization Examples

Good Case 1

Good Case 2