Solaris: Building a Multiplayer Video World Model in Minecraft
A video world model that simulates multiplayer Minecraft in real time.
TL;DR
Solaris is the first video world model that generates consistent first-person observations for multiple players simultaneously in a shared environment. Built on Minecraft, it introduces a custom data pipeline (SolarisEngine), a staged training approach combining Diffusion Transformer with Self Forcing, and a new memory-efficient variant called Checkpointed Self Forcing. It produces coherent 224-frame multiplayer videos where actions from one player correctly affect another player's perspective.
Problem
Existing action-conditioned video generation models (world models) are limited to single-agent perspectives. They can't capture multi-agent interactions — when one player builds a structure, another player's camera should see it appear. This is fundamental for modeling real-world environments where multiple actors coexist and affect each other. Prior multiplayer approaches like Enigma Multiverse just concatenate frames, which leads to action hallucinations.
Method
-
SolarisEngine — Custom multiplayer data collection system built on Mineflayer + official Minecraft clients. Captures synchronized video + action streams across multiple agents. Collected 12.64 million multiplayer frames.
-
Architecture — Adapts a pre-trained video Diffusion Transformer (DiT) with minimal modifications:
- Expanded action space for multiplayer input
- Multiplayer self-attention layers for information exchange between player viewpoints
-
Staged Training Pipeline:
- Stage 1: Bidirectional modeling (learning frame distributions)
- Stage 2: Causal modeling (learning temporal order)
- Stage 3: Self Forcing (autoregressive generation without teacher forcing gap)
-
Checkpointed Self Forcing — Memory-efficient variant that enables a longer-horizon teacher during sliding-window generation. Critical for stable long-horizon autoregressive video.
-
Evaluation Framework — Five dimensions: movement, memory, grounding, building, view consistency.
Results
| Task | Solaris | Frame Concat | No Pretrain | |------|---------|-------------|-------------| | Grounding | 62.5% | 53.1% | 29.2% | | Consistency | 71.4% | 49.5% | — | | Building | High | Low | 0.0% | | Memory | High | — | 18.8% |
- Generates stable video up to 224 frames (11.2 seconds)
- Single-player pretraining is essential — without it, performance collapses
- Frame concatenation is competitive only on simple movement but hallucinates on no-op actions
- Lower FID scores than baselines across most categories
- System and models are open-sourced
Key Concepts
- World Model — generative model that simulates environment dynamics
- Diffusion Transformer — transformer-based architecture for diffusion-based generation
- Self Forcing — training technique closing the teacher-forcing gap in autoregressive generation
- Multi-Agent Simulation — systems modeling interactions between multiple autonomous agents
My Takeaways
- The staged training pipeline (bidirectional → causal → self forcing) is elegant — each stage builds on the previous one's strengths
- Multiplayer consistency is the hard part: not just generating plausible video, but ensuring causal coherence across viewpoints
- Interesting connection to RE-TRAC-TrajectoryCompression: both papers deal with maintaining coherent state across long horizons, but in very different domains (video vs. search)
- The data pipeline (SolarisEngine) might be as important as the model itself — high-quality synchronized data is the bottleneck
Discussion & Reception
- Broad attention: 58-61 resources tracked on alphaXiv
- Multiple YouTube breakdowns and blog posts
- Coverage in Chinese AI media (ai-bot.cn, QQ News)
- From NYU (Saining Xie's group)
See Also
- RE-TRAC-TrajectoryCompression — related paper on maintaining coherent state across trajectories
- Direct Preference Optimization — another ML technique in the vault