/3 min read

Solaris: Building a Multiplayer Video World Model in Minecraft

A video world model that simulates multiplayer Minecraft in real time.

Paper →

TL;DR

Solaris is the first video world model that generates consistent first-person observations for multiple players simultaneously in a shared environment. Built on Minecraft, it introduces a custom data pipeline (SolarisEngine), a staged training approach combining Diffusion Transformer with Self Forcing, and a new memory-efficient variant called Checkpointed Self Forcing. It produces coherent 224-frame multiplayer videos where actions from one player correctly affect another player's perspective.

Problem

Existing action-conditioned video generation models (world models) are limited to single-agent perspectives. They can't capture multi-agent interactions — when one player builds a structure, another player's camera should see it appear. This is fundamental for modeling real-world environments where multiple actors coexist and affect each other. Prior multiplayer approaches like Enigma Multiverse just concatenate frames, which leads to action hallucinations.

Method

  1. SolarisEngine — Custom multiplayer data collection system built on Mineflayer + official Minecraft clients. Captures synchronized video + action streams across multiple agents. Collected 12.64 million multiplayer frames.

  2. Architecture — Adapts a pre-trained video Diffusion Transformer (DiT) with minimal modifications:

    • Expanded action space for multiplayer input
    • Multiplayer self-attention layers for information exchange between player viewpoints
  3. Staged Training Pipeline:

    • Stage 1: Bidirectional modeling (learning frame distributions)
    • Stage 2: Causal modeling (learning temporal order)
    • Stage 3: Self Forcing (autoregressive generation without teacher forcing gap)
  4. Checkpointed Self Forcing — Memory-efficient variant that enables a longer-horizon teacher during sliding-window generation. Critical for stable long-horizon autoregressive video.

  5. Evaluation Framework — Five dimensions: movement, memory, grounding, building, view consistency.

Results

| Task | Solaris | Frame Concat | No Pretrain | |------|---------|-------------|-------------| | Grounding | 62.5% | 53.1% | 29.2% | | Consistency | 71.4% | 49.5% | — | | Building | High | Low | 0.0% | | Memory | High | — | 18.8% |

  • Generates stable video up to 224 frames (11.2 seconds)
  • Single-player pretraining is essential — without it, performance collapses
  • Frame concatenation is competitive only on simple movement but hallucinates on no-op actions
  • Lower FID scores than baselines across most categories
  • System and models are open-sourced

Key Concepts

  • World Model — generative model that simulates environment dynamics
  • Diffusion Transformer — transformer-based architecture for diffusion-based generation
  • Self Forcing — training technique closing the teacher-forcing gap in autoregressive generation
  • Multi-Agent Simulation — systems modeling interactions between multiple autonomous agents

My Takeaways

  • The staged training pipeline (bidirectional → causal → self forcing) is elegant — each stage builds on the previous one's strengths
  • Multiplayer consistency is the hard part: not just generating plausible video, but ensuring causal coherence across viewpoints
  • Interesting connection to RE-TRAC-TrajectoryCompression: both papers deal with maintaining coherent state across long horizons, but in very different domains (video vs. search)
  • The data pipeline (SolarisEngine) might be as important as the model itself — high-quality synchronized data is the bottleneck

Discussion & Reception

  • Broad attention: 58-61 resources tracked on alphaXiv
  • Multiple YouTube breakdowns and blog posts
  • Coverage in Chinese AI media (ai-bot.cn, QQ News)
  • From NYU (Saining Xie's group)

See Also

  • RE-TRAC-TrajectoryCompression — related paper on maintaining coherent state across trajectories
  • Direct Preference Optimization — another ML technique in the vault

Resources

Comments