WorldArena

A Unified Benchmark for Evaluating Perception and Functional Utility of Embodied World Models

Yu Shang1,§,*, Zhuohang Li1,*, Yiding Ma1,*, Weikang Su1,*, Xin Jin, Ziyou Wang1,‡, Lei Jin1, Xin Zhang1, Yinzhou Tang1, Haisheng Su2, Chen Gao1,
Wei Wu1, Xihui Liu3, Dhruv Shah4, Zhaoxiang Zhang5, Zhibo Chen6, Jun Zhu1, Yonghong Tian7, Tat-Seng Chua8, Wenwu Zhu1, Yong Li1,†

1Tsinghua    2SJTU    3HKU    4Princeton    5CAS    6USTC    7PKU    8NUS

* Equal contribution    ‡ Equal contribution    § Project lead    † Corresponding author

Overview

WorldArena is a unified benchmark designed to systematically evaluate embodied world models across both perceptual and functional dimensions. WorldArena assesses models through three dimensions: video perception quality, measured with sixteen metrics across six sub-dimensions; embodied task functionality, which evaluates world models as synthetic data engines, policy evaluators, and action planners. Furthermore, we propose EWMScore, a holistic metric integrating multi-dimensional performance into a single interpretable index.This work provides a framework for tracking progress toward truly functional world models in embodied AI.

Comparison with Existing Benchmarks

Benchmark Video Quality Embodied Tasks Human
Visual Quality Motion Quality Content Consist. Physics Adher. Control ability 3D Acc. Data Engine Policy Eval. Action Planner
WorldModelBench
WorldSimBench
WorldScore
4DWorldBench
EWMBench
WorldEval
World-in-World
WoW-World-Eval
WorldArena

Table 1: Comprehensive comparison of WorldArena with existing world model benchmarks.

Evaluation Framework

WorldArena provides a comprehensive evaluation framework that integrates both video quality and functional utility.

Video Quality Evaluation

WorldArena measures multi-faceted video quality, comprising 16 numerical metrics across 6 key sub-dimensions,, including visual quality, motion quality, content consistency, physics adherence, 3D accuracy, and controllability:

Perceptual Evaluation Metrics

Figure 1: Six dimensions of perceptual evaluation with 16 specific metrics.

Visual Quality

  • Image Quality
  • Aesthetic Quality
  • JEPA Similarity

Motion Quality

  • Dynamic Degree
  • Flow Score
  • Motion Smoothness

Content Consistency

  • Subject Consistency
  • Background Consistency
  • Photometric Consistency

Physics Adherence

  • Interaction Quality
  • Trajectory Accuracy

3D Accuracy

  • Depth Accuracy
  • Perspectivity

Controllability

  • Instruction Following
  • Semantic Alignment
  • Action Following
EWMScore is computed as the arithmetic mean of the 16 normalized base metrics, each linearly scaled to the range [0, 100]. It provides a unified evaluation measure, where higher scores indicate stronger overall performance.

Embodied Task Functionality

WorldArena evaluates model performance in three core downstream tasks: Data Engine, Policy Evaluator, and Action Planner.

Embodied Task Functionality Evaluation

Figure 2: Embodied Task Functionality Evaluation Framework.

Embodied Data Engine

World models can generate future observations based on external instructions, enabling synthetic data generation to supplement training data for downstream embodied policy models and alleviate the scarcity of real-world data. In this part, we treat world models as embodied data synthesis engines and evaluate their performance by measuring the gain they provide to policy models. We employ a two-phase training procedure. In the first phase, we fine-tune the world model on the RobotTwin 2.0 dataset and generate synthetic videos conditioned on the first frame and external instructions. In the second phase, we freeze the world model’s weights and integrate an inverse dynamics model (IDM) to extract actions from video features. This process produces paired video-action sequences. We then evaluate the impact of world model–generated synthetic data by training a baseline π0.5 policy model with varying amounts of synthetic data. The performance gain of the policy model reflects the world model’s capability to enhance policy learning.

Embodied Policy Evaluator

We assess the capability of world models as environment proxies for evaluating policy performance. We train a series of policy models π0.5 with varying capabilities using the RoboTwin 2.0 dataset. These models are evaluated by interacting with an action-controllable world model, generating observation videos through a rollout process that continues until it exceeds 20% more frames than the corresponding ground truth video. Task success is evaluated using a VLM, which determines whether the embodied task was executed successfully. The success rate from the world model's evaluation is compared to that from the RoboTwin simulator. A high correlation between the two suggests effective simulation of real-world dynamics, while a low correlation indicates a mismatch in environmental transition simulation.

Embodied Action Planner

By predicting future state transitions, world models can function as the action-planning "brain" of an embodied agent. In this part, we investigate the ability of world models to execute embodied tasks in a closed-loop manner. Similar to the data synthesis engine setup, we pair the world model with an inverse dynamics model, where the world model takes textual instructions and the initial frame as input and outputs the corresponding action sequence for future operations. This sequence is then executed in the RoboTwin simulator, and the task success rate is measured to evaluate the world model's performance in closed-loop action execution.

Leaderboard

Visualization Examples

Good Case 1
Good Case 2