News

[HF Papers] minWM: New Open-Source Framework Builds Real-Time Interactive Video World Models

minWM provides a full-stack pipeline for turning video diffusion models into controllable, real-time interactive world models — now open-source on GitHub.

Robson PereiraMay 30, 20264 min read

minWM framework architecture for real-time interactive video world models.

[HF Papers] minWM: New Open-Source Framework Builds Real-Time Interactive Video World Models

Researchers from **Shengshu AI** have released **minWM**, a full-stack open-source framework for building real-time interactive video world models. The project, published on arXiv today (2605.30263) and featured on Hugging Face Daily Papers, tackles one of AI's hardest open problems: turning video generation models into world models that can run interactively in real time.

What minWM does

Recent video diffusion foundation models (like Wan2.1, HY1.5, and Sora-class models) can generate stunning video, but they are not interactive world models. An interactive world model needs to be:

**Controllable** — respond to camera motion and user inputs
**Causal** — generate frame-by-frame rather than all at once
**Low-latency** — fast enough for real-time interaction

minWM provides an end-to-end pipeline that converts existing bidirectional text-to-video and image-to-video foundation models into camera-controllable, few-step autoregressive world models. The key technical contributions include:

**Causal Forcing / Causal Forcing++** pipeline for autoregressive training
**Causal ODE and causal consistency distillation** for reducing inference steps
**Asymmetric DMD** for few-step generation quality
Support for both cross-attention and MMDiT-style architectures

The framework ships with runnable scripts, pre-trained checkpoints, documentation, and inference code.

Why this matters for self-hosted AI

Interactive world models represent a step change in what local AI can do. Instead of generating a video clip from a prompt (one-shot, non-interactive), a world model lets you explore a generated environment — change the camera angle, introduce new elements, and see causally consistent responses in real time.

For self-hosted AI builders, the implications span gaming, simulation, robotics training data generation, and interactive content creation. The minWM framework is designed to be architecture-extensible, meaning it can wrap different video foundation models — including ones you can run locally on consumer hardware.

The paper also provides practical ablations on camera trajectory quality, controllability training steps, and minimal batch-size requirements, making it easier for teams to adapt the framework to their own hardware constraints.

Open-source release

minWM is available on GitHub under a permissive license. The framework instantiates its pipeline on two representative backbones — **Wan2.1-T2V-1.3B** and **HY1.5-TI2V-8B** — covering both small (1.3B parameter) and medium (8B parameter) model sizes that can run on consumer GPUs.

The project also supports adapting existing video world models like HY-WorldPlay to new data distributions and latency targets, making it a practical tool for researchers and engineers building on top of video generation technology.

Source

arXiv: minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models
GitHub: shengshu-ai/minWM
Hugging Face Papers: Daily Papers 2026-05-30

[HF Papers] minWM: New Open-Source Framework Builds Real-Time Interactive Video World Models

[HF Papers] minWM: New Open-Source Framework Builds Real-Time Interactive Video World Models

What minWM does

Why this matters for self-hosted AI

Open-source release

Source

Related articles

US Government Forces Anthropic to Suspend Fable 5 and Mythos 5 Worldwide — National Security Directive Blocks Non-US Access

[TechCrunch] After Nvidia's $20B Deal, AI Chip Startup Groq Reportedly Raising $650M

[TechCrunch] GitHub Copilot's Token Billing Backlash: What It Means for Self-Hosted AI