Models

[News] Liquid AI Releases LFM2.5-8B-A1B: On-Device MoE with 128K Context

Liquid AI's new LFM2.5-8B-A1B packs 8B total parameters (1B active) with a 128K context window, trained on 38 trillion tokens, and runs on llama.cpp, MLX, vLLM and SGLang from day one.

Robson PereiraMay 31, 20264 min read
Liquid AI LFM2.5-8B-A1B model card artwork.

[News] Liquid AI Releases LFM2.5-8B-A1B: On-Device MoE with 128K Context

**Liquid AI** has released **LFM2.5-8B-A1B**, a new hybrid Mixture-of-Experts model designed specifically for on-device deployment. The model is now available on Hugging Face under the LFM 1.0 license and supports all major local inference engines from day one.

Key specifications

| Spec | Value |

|------|-------|

| Total parameters | 8B |

| Active parameters | 1B (MoE architecture) |

| Context window | 128,000 tokens (up from 32K in LFM2) |

| Training data | 38 trillion tokens (up from 12T) |

| Vocabulary | Expanded 65K → 130K BPE tokenizer |

| Pipeline | text-generation |

| Languages | EN, AR, ZH, FR, DE, JA, KO, ES, PT |

What's new in LFM2.5

This release builds on the LFM2-8B-A1B from October 2025 with three major upgrades:

**Scaled pretraining.** Liquid AI tripled the training corpus from 12 trillion to 38 trillion tokens, giving the model significantly broader knowledge coverage while keeping the compact 1B-active architecture.

**128K context window.** The model now handles documents up to 128K tokens, enabled through a two-phase extension: a 2T-token mid-training phase focused on reasoning, math, and tool-use at 32K, followed by extension to 128K via targeted continued training.

**Reasoning-only inference.** Unlike its predecessor, LFM2.5-8B-A1B produces an explicit chain of thought before every final answer. This addresses the compute-bound nature of MoE architectures — the reasoning trace compensates for the small active parameter count.

Performance highlights

  • **Hallucination reduction:** A targeted RL stage using avg@k-based reward achieved significantly lower hallucination rates while maintaining accuracy
  • **Agentic benchmarks:** Competitive with much larger models on Tau2-Telecom and other agentic harnesses
  • **Instruction following:** Matches bigger dense and MoE models on instruction-following tasks
  • **Doom loop mitigation:** A preference optimisation stage reduces long reasoning trace looping

Self-hosted deployment

LFM2.5-8B-A1B has day-one support for every major local inference engine:

  • **llama.cpp** and **GGUF** — ideal for CPU and mixed CPU/GPU setups
  • **MLX** — optimised for Apple Silicon
  • **vLLM** and **SGLang** — for GPU-heavy production deployments

With only 1B active parameters, this model is an excellent candidate for running on consumer hardware. See Best Hardware for Self-Hosted AI for sizing guidance, and GGUF Quantisation Guide for compression options.

For comparison with other compact models, see our coverage of Phi-4 and Gemma 3.

**Sources:**

Related articles