News

[HF Papers] minWM: New Open-Source Framework Builds Real-Time Interactive Video World Models

minWM provides a full-stack pipeline for turning video diffusion models into controllable, real-time interactive world models — now open-source on GitHub.

Robson PereiraMay 30, 20264 min read
minWM framework architecture for real-time interactive video world models.

[HF Papers] minWM: New Open-Source Framework Builds Real-Time Interactive Video World Models

Researchers from **Shengshu AI** have released **minWM**, a full-stack open-source framework for building real-time interactive video world models. The project, published on arXiv today (2605.30263) and featured on Hugging Face Daily Papers, tackles one of AI's hardest open problems: turning video generation models into world models that can run interactively in real time.

What minWM does

Recent video diffusion foundation models (like Wan2.1, HY1.5, and Sora-class models) can generate stunning video, but they are not interactive world models. An interactive world model needs to be:

  • **Controllable** — respond to camera motion and user inputs
  • **Causal** — generate frame-by-frame rather than all at once
  • **Low-latency** — fast enough for real-time interaction

minWM provides an end-to-end pipeline that converts existing bidirectional text-to-video and image-to-video foundation models into camera-controllable, few-step autoregressive world models. The key technical contributions include:

  • **Causal Forcing / Causal Forcing++** pipeline for autoregressive training
  • **Causal ODE and causal consistency distillation** for reducing inference steps
  • **Asymmetric DMD** for few-step generation quality
  • Support for both cross-attention and MMDiT-style architectures

The framework ships with runnable scripts, pre-trained checkpoints, documentation, and inference code.

Why this matters for self-hosted AI

Interactive world models represent a step change in what local AI can do. Instead of generating a video clip from a prompt (one-shot, non-interactive), a world model lets you explore a generated environment — change the camera angle, introduce new elements, and see causally consistent responses in real time.

For self-hosted AI builders, the implications span gaming, simulation, robotics training data generation, and interactive content creation. The minWM framework is designed to be architecture-extensible, meaning it can wrap different video foundation models — including ones you can run locally on consumer hardware.

The paper also provides practical ablations on camera trajectory quality, controllability training steps, and minimal batch-size requirements, making it easier for teams to adapt the framework to their own hardware constraints.

For related coverage, read Liquid AI LFM 2.5-8B-A1B: Run This New MoE Model Locally and Best Hardware for Self-Hosted AI.

Open-source release

minWM is available on GitHub under a permissive license. The framework instantiates its pipeline on two representative backbones — **Wan2.1-T2V-1.3B** and **HY1.5-TI2V-8B** — covering both small (1.3B parameter) and medium (8B parameter) model sizes that can run on consumer GPUs.

The project also supports adapting existing video world models like HY-WorldPlay to new data distributions and latency targets, making it a practical tool for researchers and engineers building on top of video generation technology.

Source

Related articles