Models

How to Run Liquid AI LFM2.5-8B-A1B Locally: A New MoE Model for Consumer Hardware

Liquid AI's LFM2.5-8B-A1B is an 8B-parameter MoE model with only 1B active parameters, trained on 38T tokens with 128K context — and it runs on consumer hardware via llama.cpp and GGUF.

Robson PereiraMay 30, 202610 min read
Liquid AI LFM2.5-8B-A1B MoE model architecture running on local hardware.

How to Run Liquid AI LFM2.5-8B-A1B Locally: A New MoE Model for Consumer Hardware

**Liquid AI** released **LFM2.5-8B-A1B** on 28 May 2026 — a mixture-of-experts model that packs 8 billion total parameters but activates only 1 billion per token, trained on 38 trillion tokens with a 128,000-token context window. It is the fastest model in its size class on both CPU and GPU inference and ships with day-one GGUF support for llama.cpp, MLX, vLLM, and SGLang.

For the self-hosted AI community, this is significant: it matches or beats models 3–5× its size on instruction following and agentic benchmarks while running on a laptop with under 6 GB of memory.

What makes LFM2.5-8B-A1B different

The model architecture combines MoE (Mixture of Experts), GQA (Grouped Query Attention), and gated short convolution blocks — but the headline feature is the **1B active parameter count**. Only one billion parameters activate per token, which means inference stays fast even on consumer hardware.

Key improvements over the previous LFM2-8B-A1B:

  • **Context window:** expanded from 32K to 128K tokens
  • **Vocabulary:** doubled from 65K to 128K tokens for better non-Latin script support
  • **Reasoning:** now a reasoning-only model with explicit chain-of-thought before answers
  • **Training data:** scaled from 12T to 38T tokens with large-scale reinforcement learning
  • **Anti-hallucination:** targeted RL stage using an AA-Omniscience-based reward to reduce hallucinations — a known weakness in edge models

The model comes in two variants: **LFM2.5-8B-A1B-Base** (the pretrained foundation) and **LFM2.5-8B-A1B** (post-trained for chat, tool calling, and instruction following).

Hardware requirements

The active-parameter sparsity means you can run this on surprisingly modest hardware. Liquid AI reports:

  • **Laptop-class chips:** 253 tokens/second on an Apple M5 Max, 146 tok/s on an AMD Ryzen AI Max+ 395, both under 6 GB memory
  • **Desktop GPU (single H100):** 6,112 tok/s at batch size 256 output throughput
  • **CPU-only:** functional on modern laptops with llama.cpp

For comparison, previous models of this capability class typically required 16–32 GB VRAM and a dedicated GPU. If you have already read Best Hardware for Self-Hosted AI, you will recognise that this changes the hardware calculus significantly.

How to run it with Ollama and llama.cpp

The day-one GGUF support means you can run LFM2.5-8B-A1B with the same tools you already use for other local models.

Step 1: Install or update Ollama

If you do not have Ollama yet, follow How to Run Llama 3 Locally with Ollama for the base setup. Then update to the latest version:

```bash

ollama pull hf.co/LiquidAI/LFM2.5-8B-A1B-GGUF

```

Ollama automatically detects and uses the correct GGUF quantization for your hardware.

Step 2: Run the model

```bash

ollama run hf.co/LiquidAI/LFM2.5-8B-A1B-GGUF

```

Test it with a tool-calling prompt to see the model's reasoning mode in action:

```bash

ollama run hf.co/LiquidAI/LFM2.5-8B-A1B-GGUF "List the files in the current directory, read config.json, and summarise its contents"

```

Step 3: Manual llama.cpp setup

For direct control over inference parameters:

```bash

Clone or update llama.cpp

git clone https://github.com/ggerganov/llama.cpp

cd llama.cpp

make -j

Download the GGUF

wget https://huggingface.co/LiquidAI/LFM2.5-8B-A1B-GGUF/resolve/main/LFM2.5-8B-A1B-Q4_K_M.gguf

Run with 128K context

./llama-cli -m LFM2.5-8B-A1B-Q4_K_M.gguf -c 131072 -p "Your prompt here"

```

Performance comparison

On the AA-Omniscience Index (which rewards correct answers and penalises hallucinations), LFM2.5-8B-A1B achieves a **non-hallucination rate** that competes with models like Granite-4.0-H-Tiny, Qwen3-30B-A3B-Thinking, and Gemma-4-26B-A4B-IT — despite using 3–26× fewer active parameters.

On agentic benchmarks, it is particularly strong on **Tau2-Telecom** and general instruction-following tasks, matching bigger dense and MoE models. For a broader comparison of local inference engines that support GGUF, read Ollama vs vLLM vs llama.cpp.

Tool calling and agentic workflows

LFM2.5-8B-A1B was explicitly designed for tool calling. Liquid AI's open-source desktop agent demo runs the model with 67 tools across 13 MCP servers entirely on a single laptop — no cloud, no API keys, no data leaving the machine. The company reports that tool selection is noticeably faster and more reliable compared to the previous generation.

This makes LFM2.5-8B-A1B a strong candidate for local agent workflows. If you are building AI agents locally, see OpenMonoAgent.ai: Set Up a Local-First Coding Agent for a practical walkthrough. For a comparison of different coding agent tools, read Claude Code vs Codex vs Kimi Code.

Conclusion

Liquid AI LFM2.5-8B-A1B changes what is possible on consumer hardware. A model that activates only 1B parameters per token, trained on 38T tokens with 128K context, and competitive with models 3–5× its size — all running on a laptop under 6 GB of memory — represents a genuine step forward for self-hosted AI.

The day-one GGUF support means the local model ecosystem already supports it. Pull the model, test it on your own tasks, and see whether the MoE architecture delivers the speed and quality Liquid AI promises.

Source

Related articles