Matrix-Game:
Interactive World Foundation Model

Yifan Zhang*  Chunli Peng*  Boyang Wang*†  Puyi Wang  Qingcheng Zhu
Zedong Gao  Eric Li  Yang Liu  Yahui Zhou

Skywork AI

Technical Report GitHub Benchmark 🤗HuggingFace 🚀Dataset (Comming soon)

Abstract

We introduce Matrix-Game, an interactive world foundation model for controllable game world generation. Matrix-Game is trained using a two-stage pipeline that first performs large-scale unlabeled pretraining for environment understanding, followed by action-labeled fine-tuning for interactive video generation. To support this, we curate Matrix-Game-MC, a comprehensive Minecraft dataset comprising over 2,700 hours of unlabeled gameplay video clips and over 1,000 hours of high-quality labeled clips with fine-grained keyboard and mouse action annotations. Our model adopts an controllable image-to-world generation paradigm, conditioned on a reference image, motion frames, and user actions. With over 17 billion parameters, Matrix-Game enables precise control over character actions and camera movements, while maintaining high visual quality and temporal coherence. To evaluate performance, we propose GameWorld Score, a unified benchmark measuring visual quality, temporal quality, action controllability, and physical rule understanding. Extensive experiments show that Matrix-Game consistently outperforms prior open-source Minecraft world models—including Oasis and MineWorld—across all metrics, with particularly strong gains in controllability and 3D consistency. Human evaluations further confirm these findings, highlighting the model’s ability to produce physically grounded and perceptually realistic interactive videos in diverse scenarios.

Model Overview

Matrix-Game adopts an image-to-world generation paradigm, using a single reference image as the primary prior for world understanding and video generation. To enable long-duration video generation, Matrix-Game uses an autoregressive strategy that preserves local temporal consistency across segments, enabling the model to maintain coherent dynamics over extended time horizons.

Matrix Game

GameWorld Score Benchmark

To facilitate the evaluation and comparison of Minecraft world models, GameWorld Score is introduced as a unified benchmark that assesses not only the perceptual quality of generated videos but also their controllability and physical plausibility. The benchmark decomposes world model performance into eight distinct dimensions, each capturing a specific aspect of video generation.

Matrix Game

Metric Descriptions:

  • Image Quality / Aesthetic Quality: Visual fidelity and perceptual appeal of generated frames
  • Temporal Consistency / Motion Smoothness: Temporal coherence and smoothness between frames
  • Keyboard Accuracy / Mouse Accuracy: Accuracy in following user control signals
  • 3D Consistency: Geometric stability and physical plausibility over time

Performance Comparison

1. GameWorld Score Benchmark Comparison

Matrix-Game achieves consistently best performance on the GameWorld Score benchmark.

Model Image Quality ↑ Aesthetic Quality↑ Temporal Cons. ↑ Motion Smooth. ↑ Keyboard Acc. ↑ Mouse Acc. ↑ 3D Cons. ↑
Oasis 0.65 0.48 0.94 0.98 0.77 0.56 0.56
MineWorld 0.69 0.47 0.95 0.98 0.86 0.64 0.51
Ours 0.72 0.49 0.97 0.98 0.95 0.95 0.76

2. Human Evaluation

A double-blind human evaluation demonstrates that Matrix-Game significantly outperforms Oasis and MineWorld in terms of overall quality, controllability, visual quality, and temporal consistency.

Matrix Game
Double-blind human evaluation by two independent groups across four key dimensions: Overall Quality, Controllability, Visual Quality, and Temporal Consistency.
Scores represent the percentage of pairwise comparisons in which each method was preferred. Matrix-Game consistently outperforms prior models across all metrics and both groups.

Generation across Diverse Minecraft Scenarios

Matrix-Game demonstrates strong generalization across eight diverse Minecraft environments with varying terrain and interaction dynamics.

Generation with Keyboard Control

Matrix-Game can generate high-quality videos that precisely follow simple keyboard instructions—including directional movements (forward, backward, left, right), jump, and attack.

Generation with Mouse Control

Matrix-Game can generate high-quality videos with fine-grained camera viewpoint control, accurately following directional inputs—including upward, downward, leftward, rightward, and diagonal perspective shifts.

Generation with Compound Actions

Matrix-Game is capable of generating high-quality videos that accurately follow complex action instructions.

Generation with Dynamic Actions

Matrix-Game can handle dynamically changing action instructions during a single video generation process.

Long Video Generation

Matrix-Game demonstrates strong auto-regressive generation capabilities for producing long videos.

More Game Scenarios beyond Minecraft

Our model demonstrates good potential to generalize to a broader set of game scenarios built with Unreal Engine.

Acknowledgement

We would like to express our gratitude to:

We are grateful to the broader research community for their open exploration and contributions to the field of interactive world generation.

Citation

@article{zhang2025matrixgame,
  title     = {Matrix-Game: Interactive World Foundation Model},
  author    = {Yifan Zhang and Chunli Peng and Boyang Wang and Puyi Wang and Qingcheng Zhu and Zedong Gao and Eric Li and Yang Liu and Yahui Zhou},
  journal   = {arXiv},
  year      = {2025}
}