Solaris: Building a Multiplayer Video World Model in Minecraft

TL;DR

Solaris is the first video world model that generates consistent first-person observations for multiple players simultaneously in a shared environment. Built on Minecraft, it introduces a custom data pipeline (SolarisEngine), a staged training approach combining Diffusion Transformer with Self Forcing, and a new memory-efficient variant called Checkpointed Self Forcing. It produces coherent 224-frame multiplayer videos where actions from one player correctly affect another player's perspective.

Problem

Existing action-conditioned video generation models (world models) are limited to single-agent perspectives. They can't capture multi-agent interactions — when one player builds a structure, another player's camera should see it appear. This is fundamental for modeling real-world environments where multiple actors coexist and affect each other. Prior multiplayer approaches like Enigma Multiverse just concatenate frames, which leads to action hallucinations.

Method

SolarisEngine — Custom multiplayer data collection system built on Mineflayer + official Minecraft clients. Captures synchronized video + action streams across multiple agents. Collected 12.64 million multiplayer frames.
Architecture — Adapts a pre-trained video Diffusion Transformer (DiT) with minimal modifications:
- Expanded action space for multiplayer input
- Multiplayer self-attention layers for information exchange between player viewpoints
Staged Training Pipeline:
- Stage 1: Bidirectional modeling (learning frame distributions)
- Stage 2: Causal modeling (learning temporal order)
- Stage 3: Self Forcing (autoregressive generation without teacher forcing gap)
Checkpointed Self Forcing — Memory-efficient variant that enables a longer-horizon teacher during sliding-window generation. Critical for stable long-horizon autoregressive video.
Evaluation Framework — Five dimensions: movement, memory, grounding, building, view consistency.

Results

| Task | Solaris | Frame Concat | No Pretrain | |------|---------|-------------|-------------| | Grounding | 62.5% | 53.1% | 29.2% | | Consistency | 71.4% | 49.5% | — | | Building | High | Low | 0.0% | | Memory | High | — | 18.8% |

Generates stable video up to 224 frames (11.2 seconds)
Single-player pretraining is essential — without it, performance collapses
Frame concatenation is competitive only on simple movement but hallucinates on no-op actions
Lower FID scores than baselines across most categories
System and models are open-sourced

Key Concepts

World Model — generative model that simulates environment dynamics
Diffusion Transformer — transformer-based architecture for diffusion-based generation
Self Forcing — training technique closing the teacher-forcing gap in autoregressive generation
Multi-Agent Simulation — systems modeling interactions between multiple autonomous agents

My Takeaways

The staged training pipeline (bidirectional → causal → self forcing) is elegant — each stage builds on the previous one's strengths
Multiplayer consistency is the hard part: not just generating plausible video, but ensuring causal coherence across viewpoints
Interesting connection to RE-TRAC-TrajectoryCompression: both papers deal with maintaining coherent state across long horizons, but in very different domains (video vs. search)
The data pipeline (SolarisEngine) might be as important as the model itself — high-quality synchronized data is the bottleneck

Discussion & Reception

Broad attention: 58-61 resources tracked on alphaXiv
Multiple YouTube breakdowns and blog posts
Coverage in Chinese AI media (ai-bot.cn, QQ News)
From NYU (Saining Xie's group)