Models
[News] Liquid AI Releases LFM2.5-8B-A1B: On-Device MoE with 128K Context
Liquid AI's new LFM2.5-8B-A1B packs 8B total parameters (1B active) with a 128K context window, trained on 38 trillion tokens, and runs on llama.cpp, MLX, vLLM and SGLang from day one.

[News] Liquid AI Releases LFM2.5-8B-A1B: On-Device MoE with 128K Context
**Liquid AI** has released **LFM2.5-8B-A1B**, a new hybrid Mixture-of-Experts model designed specifically for on-device deployment. The model is now available on Hugging Face under the LFM 1.0 license and supports all major local inference engines from day one.
Key specifications
| Spec | Value |
|------|-------|
| Total parameters | 8B |
| Active parameters | 1B (MoE architecture) |
| Context window | 128,000 tokens (up from 32K in LFM2) |
| Training data | 38 trillion tokens (up from 12T) |
| Vocabulary | Expanded 65K → 130K BPE tokenizer |
| Pipeline | text-generation |
| Languages | EN, AR, ZH, FR, DE, JA, KO, ES, PT |
What's new in LFM2.5
This release builds on the LFM2-8B-A1B from October 2025 with three major upgrades:
**Scaled pretraining.** Liquid AI tripled the training corpus from 12 trillion to 38 trillion tokens, giving the model significantly broader knowledge coverage while keeping the compact 1B-active architecture.
**128K context window.** The model now handles documents up to 128K tokens, enabled through a two-phase extension: a 2T-token mid-training phase focused on reasoning, math, and tool-use at 32K, followed by extension to 128K via targeted continued training.
**Reasoning-only inference.** Unlike its predecessor, LFM2.5-8B-A1B produces an explicit chain of thought before every final answer. This addresses the compute-bound nature of MoE architectures — the reasoning trace compensates for the small active parameter count.
Performance highlights
- **Hallucination reduction:** A targeted RL stage using avg@k-based reward achieved significantly lower hallucination rates while maintaining accuracy
- **Agentic benchmarks:** Competitive with much larger models on Tau2-Telecom and other agentic harnesses
- **Instruction following:** Matches bigger dense and MoE models on instruction-following tasks
- **Doom loop mitigation:** A preference optimisation stage reduces long reasoning trace looping
Self-hosted deployment
LFM2.5-8B-A1B has day-one support for every major local inference engine:
- **llama.cpp** and **GGUF** — ideal for CPU and mixed CPU/GPU setups
- **MLX** — optimised for Apple Silicon
- **vLLM** and **SGLang** — for GPU-heavy production deployments
With only 1B active parameters, this model is an excellent candidate for running on consumer hardware. See Best Hardware for Self-Hosted AI for sizing guidance, and GGUF Quantisation Guide for compression options.
For comparison with other compact models, see our coverage of Phi-4 and Gemma 3.
**Sources:**


