News
[HF Papers] minWM: New Open-Source Framework Builds Real-Time Interactive Video World Models
minWM provides a full-stack pipeline for turning video diffusion models into controllable, real-time interactive world models — now open-source on GitHub.

[HF Papers] minWM: New Open-Source Framework Builds Real-Time Interactive Video World Models
Researchers from **Shengshu AI** have released **minWM**, a full-stack open-source framework for building real-time interactive video world models. The project, published on arXiv today (2605.30263) and featured on Hugging Face Daily Papers, tackles one of AI's hardest open problems: turning video generation models into world models that can run interactively in real time.
What minWM does
Recent video diffusion foundation models (like Wan2.1, HY1.5, and Sora-class models) can generate stunning video, but they are not interactive world models. An interactive world model needs to be:
- **Controllable** — respond to camera motion and user inputs
- **Causal** — generate frame-by-frame rather than all at once
- **Low-latency** — fast enough for real-time interaction
minWM provides an end-to-end pipeline that converts existing bidirectional text-to-video and image-to-video foundation models into camera-controllable, few-step autoregressive world models. The key technical contributions include:
- **Causal Forcing / Causal Forcing++** pipeline for autoregressive training
- **Causal ODE and causal consistency distillation** for reducing inference steps
- **Asymmetric DMD** for few-step generation quality
- Support for both cross-attention and MMDiT-style architectures
The framework ships with runnable scripts, pre-trained checkpoints, documentation, and inference code.
Why this matters for self-hosted AI
Interactive world models represent a step change in what local AI can do. Instead of generating a video clip from a prompt (one-shot, non-interactive), a world model lets you explore a generated environment — change the camera angle, introduce new elements, and see causally consistent responses in real time.
For self-hosted AI builders, the implications span gaming, simulation, robotics training data generation, and interactive content creation. The minWM framework is designed to be architecture-extensible, meaning it can wrap different video foundation models — including ones you can run locally on consumer hardware.
The paper also provides practical ablations on camera trajectory quality, controllability training steps, and minimal batch-size requirements, making it easier for teams to adapt the framework to their own hardware constraints.
For related coverage, read Liquid AI LFM 2.5-8B-A1B: Run This New MoE Model Locally and Best Hardware for Self-Hosted AI.
Open-source release
minWM is available on GitHub under a permissive license. The framework instantiates its pipeline on two representative backbones — **Wan2.1-T2V-1.3B** and **HY1.5-TI2V-8B** — covering both small (1.3B parameter) and medium (8B parameter) model sizes that can run on consumer GPUs.
The project also supports adapting existing video world models like HY-WorldPlay to new data distributions and latency targets, making it a practical tool for researchers and engineers building on top of video generation technology.
Source
- arXiv: minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models
- GitHub: shengshu-ai/minWM
- Hugging Face Papers: Daily Papers 2026-05-30

