Yifan Zhang*
Chunli Peng*
Boyang Wang*†
Puyi Wang
Qingcheng Zhu
Zedong Gao
Eric Li
Yang Liu
Yahui Zhou
Skywork AI
Technical Report GitHub Benchmark 🤗HuggingFace 🚀Dataset (Comming soon)
Matrix-Game adopts an image-to-world generation paradigm, using a single reference image as the primary prior for world understanding and video generation. To enable long-duration video generation, Matrix-Game uses an autoregressive strategy that preserves local temporal consistency across segments, enabling the model to maintain coherent dynamics over extended time horizons.
To facilitate the evaluation and comparison of Minecraft world models, GameWorld Score is introduced as a unified benchmark that assesses not only the perceptual quality of generated videos but also their controllability and physical plausibility. The benchmark decomposes world model performance into eight distinct dimensions, each capturing a specific aspect of video generation.
Metric Descriptions:
Matrix-Game achieves consistently best performance on the GameWorld Score benchmark.
Model | Image Quality ↑ | Aesthetic Quality↑ | Temporal Cons. ↑ | Motion Smooth. ↑ | Keyboard Acc. ↑ | Mouse Acc. ↑ | 3D Cons. ↑ |
---|---|---|---|---|---|---|---|
Oasis | 0.65 | 0.48 | 0.94 | 0.98 | 0.77 | 0.56 | 0.56 |
MineWorld | 0.69 | 0.47 | 0.95 | 0.98 | 0.86 | 0.64 | 0.51 |
Ours | 0.72 | 0.49 | 0.97 | 0.98 | 0.95 | 0.95 | 0.76 |
A double-blind human evaluation demonstrates that Matrix-Game significantly outperforms Oasis and MineWorld in terms of overall quality, controllability, visual quality, and temporal consistency.
Double-blind human evaluation by two independent groups across four key dimensions: Overall Quality, Controllability, Visual Quality, and Temporal Consistency.
Scores represent the percentage of pairwise comparisons in which each method was preferred. Matrix-Game consistently outperforms prior models across all metrics and both groups.
Matrix-Game demonstrates strong generalization across eight diverse Minecraft environments with varying terrain and interaction dynamics.
Matrix-Game can generate high-quality videos that precisely follow simple keyboard instructions—including directional movements (forward, backward, left, right), jump, and attack.
Matrix-Game can generate high-quality videos with fine-grained camera viewpoint control, accurately following directional inputs—including upward, downward, leftward, rightward, and diagonal perspective shifts.
Matrix-Game is capable of generating high-quality videos that accurately follow complex action instructions.
Matrix-Game can handle dynamically changing action instructions during a single video generation process.
Matrix-Game demonstrates strong auto-regressive generation capabilities for producing long videos.
Our model demonstrates good potential to generalize to a broader set of game scenarios built with Unreal Engine.
We would like to express our gratitude to:
We are grateful to the broader research community for their open exploration and contributions to the field of interactive world generation.
@article{zhang2025matrixgame,
title = {Matrix-Game: Interactive World Foundation Model},
author = {Yifan Zhang and Chunli Peng and Boyang Wang and Puyi Wang and Qingcheng Zhu and Zedong Gao and Eric Li and Yang Liu and Yahui Zhou},
journal = {arXiv},
year = {2025}
}