Models

Liquid AI LFM2.5-8B-A1B: The 1B-Active-Parameter MoE Model That Runs on Consumer Hardware

Liquid AI released LFM2.5-8B-A1B, a Mixture-of-Experts model with only 1B active parameters per token, a 128K context window, and day-one support for llama.cpp — making it one of the most efficient models for local inference.

Robson PereiraMay 30, 20269 min read
Liquid AI LFM2.5-8B-A1B benchmark results on consumer hardware

Liquid AI LFM2.5-8B-A1B: The 1B-Active-Parameter MoE Model That Runs on Consumer Hardware

Liquid AI has released **LFM2.5-8B-A1B**, the latest version of their Mixture-of-Experts (MoE) model designed for on-device AI. With only 1 billion active parameters per token out of 8 billion total, a 128K context window, and day-one support for llama.cpp, MLX, vLLM, and SGLang, this is one of the most practical local AI model releases this year.

The model hit the front page of Hacker News within hours of its announcement, and for good reason: it demonstrates that efficient MoE architectures can deliver frontier-competitive performance on consumer hardware.

What makes LFM2.5-8B-A1B different

The headline number is the **1B active parameters**. In a traditional dense model, all parameters are used for every token. In this MoE (Mixture-of-Experts) architecture, the model has specialised "expert" sub-networks, and only a subset is activated per token. This means you get the knowledge capacity of an 8B-parameter model with the compute cost of a 1B-parameter model.

Key improvements over the previous LFM2-8B-A1B release:

  • **128K context window** (up from 32K), enabling processing of much longer documents.
  • **38T tokens of pretraining** (up from 12T), giving the model broader knowledge.
  • **128K vocabulary** (up from 65K), with dramatically better tokenisation for non-Latin scripts including Hindi, Thai, Vietnamese, and Arabic.
  • **Reasoning capability.** The model now produces an explicit chain of thought before its final answer, making it competitive on complex reasoning tasks.
  • **Reinforcement learning for reduced hallucination.** A targeted RL stage using an average-based reward function significantly reduces hallucination rates while maintaining accuracy.

Local performance that matters

The performance figures reported by Liquid AI are striking for a model that runs on consumer hardware:

  • **253 tokens/second** on an Apple M5 Max chip
  • **146 tokens/second** on an AMD Ryzen AI Max+ 395
  • Under **6 GB of memory usage** during inference

For context, that is faster than most dense 7B-8B models on the same hardware, despite having comparable or better output quality.

Day-one inference ecosystem support

LFM2.5-8B-A1B ships with immediate support across the entire open-source inference ecosystem:

  • **llama.cpp / llama-server** — the go-to runtime for CPU and GPU inference on consumer hardware
  • **MLX** — Apple Silicon-optimised inference
  • **vLLM** — GPU-accelerated serving for production throughput
  • **SGLang** — structured generation and reasoning support
  • **Liquid Apollo** — Liquid's own Edge AI Platform for iOS and Android deployment

This means you can start running the model within minutes on hardware you already own. No special setup, no proprietary runtime, no cloud dependency.

How to run it locally with llama.cpp

Because LFM2.5-8B-A1B ships with GGUF support on day one, running it locally is straightforward if you already have llama.cpp set up.

Using llama-server

```bash

Download the GGUF file from Hugging Face (search for "LFM2.5-8B-A1B-GGUF")

wget https://huggingface.co/liquid-ai/LFM2.5-8B-A1B-GGUF/resolve/main/lfm2.5-8b-a1b.Q4_K_M.gguf

Run with llama-server

llama-server -m lfm2.5-8b-a1b.Q4_K_M.gguf --ctx-size 32768 --n-gpu-layers 99 --port 8080

```

Using Ollama

If Ollama adds the model to its library (follow the Ollama model library), you will be able to pull and run it with:

```bash

ollama pull liquid-lfm2.5

ollama run liquid-lfm2.5

```

You can also create a custom Modelfile pointing to the GGUF file.

Using vLLM for production serving

For higher throughput or multi-user setups:

```bash

vllm serve liquid-ai/LFM2.5-8B-A1B --tensor-parallel-size 1 --max-model-len 32768

```

How it compares to other local-friendly models

The LFM2.5-8B-A1B occupies a unique position in the local model landscape:

  • **Vs Llama 3.1 8B (dense):** Similar total parameter count, but the MoE architecture means the Liquid model uses roughly 1/8th the compute per token for similar quality.
  • **Vs Qwen 2.5 7B (dense):** The 128K context window and reasoning capability give the Liquid model an edge on long-context and complex tasks.
  • **Vs Phi-4 (dense):** Phi-4 is strong for its size, but the Liquid model's MoE architecture scales better to larger knowledge capacity at inference time.
  • **Vs DeepSeek-V2-Lite (MoE):** Similar MoE philosophy, but LFM2.5 is optimised specifically for on-device deployment with aggressive memory optimisation.

For guidance on fitting models to your hardware, see How to Choose the Right Local Model Size.

Why MoE matters for self-hosted AI

MoE architectures are not new — Mixtral 8x22B and DeepSeek-V2 have demonstrated them at scale. What makes LFM2.5-8B-A1B different is that it brings MoE efficiency to the **consumer hardware tier**:

1. **Lower VRAM requirements.** With only 1B active parameters, the model fits in less than 6 GB of memory — well within reach of most consumer GPUs and even high-end laptops.

2. **Faster inference.** Fewer active parameters means fewer matrix operations per token, translating directly to higher tokens-per-second on the same hardware.

3. **Better multitasking.** The memory savings leave headroom for other services such as vector databases, chat UIs, and workflow automation running alongside the model.

To benchmark this model against your current setup, see How to Benchmark Local AI Performance Properly.

Practical use cases

  • **Long-document Q&A.** The 128K context window makes this model ideal for querying over manuals, research papers, or legal documents without needing a separate RAG pipeline. Combine it with Open WebUI for a chat interface — see Open WebUI Setup for Local Documents.
  • **Agentic tool calling.** Liquid AI specifically benchmarked this model on agentic tasks (Tau2-Bench, BFCL), where it competes with models much larger than itself. This makes it a strong candidate for the model backend behind tools like n8n workflows — see Build Your Own AI Assistant with n8n.
  • **Always-on chat assistant.** At under 6 GB memory usage, this model can run continuously alongside other services on a homelab server without dominating available VRAM.

What to watch out for

The model is very new, and community tooling (quantised variants, LoRA adapters, fine-tuning guides) will take a few weeks to mature. The GGUF support is solid, but expect more quantisation options and Ollama integration to appear over the coming days.

If you are comparing it against other local models, run your own benchmark prompts rather than relying solely on published numbers — your specific workload may favour a different architecture.

Conclusion

LFM2.5-8B-A1B is one of the most compelling local AI model releases this year. Its MoE architecture delivers frontier-competitive performance at consumer-hardware memory and compute budgets, and its day-one support across llama.cpp, MLX, vLLM, and SGLang means you can start using it immediately. For self-hosted AI practitioners, this is a model worth downloading and testing.

FAQ

How much VRAM do I need?

Around 6 GB for the Q4_K_M quantisation. This fits on most consumer GPUs with 8 GB or more.

Can it run on CPU only?

Yes. llama.cpp supports CPU-only inference, and the MoE architecture means fewer active parameters per token, making CPU inference more practical than with dense models of similar size.

Is the model open source?

Liquid AI publishes the model weights under a commercial license. Check the model card on Hugging Face for specific licensing terms.

Does it support function calling?

Yes. LFM2.5-8B-A1B was explicitly trained and benchmarked on tool-use and agentic tasks, making it suitable for workflow automation.

Related articles