Models

GGUF Quantisation Guide: Choosing the Right Format for Your Local LLM

Understand GGUF quantisation levels, choose Q2 through Q8 for your hardware, and balance quality against VRAM usage for every local model.

Robson PereiraMay 31, 202610 min read

GGUF quantisation levels compared on GPU hardware for local AI inference.

GGUF Quantisation Guide: Choosing the Right Format for Your Local LLM

GGUF is the dominant quantisation format for local LLMs, and for good reason. It handles the conversion from full-precision weights to compressed formats that fit on consumer hardware while preserving as much model quality as possible. Choosing the right quantisation level is one of the most important decisions you make when deploying a local model.

What GGUF does

GGUF is a file format designed by the llama.cpp project to store quantised model weights along with metadata. It replaces the earlier GGML format and adds support for more quantisation types, better metadata handling, and forward compatibility.

When you quantise a model to GGUF, you reduce the precision of each weight from 16-bit floating point to a lower bit width. A Q4 quantisation uses 4 bits per weight, which means roughly one-quarter of the memory compared to the full-precision model.

For the foundational knowledge, read Quantisation Levels Explained for Real-World Local AI.

Quantisation levels explained

Q2 (2 bits per weight)

The most aggressive compression. Quality loss is noticeable for most tasks. Only use Q2 when you need to fit a model that would otherwise be impossible on your hardware. Suitable for experimentation or very lightweight tasks.

Q3 (3 bits per weight)

Moderate quality loss. Usable for simple chat, classification, and extraction tasks where every bit of reasoning quality is not critical. Saves significant VRAM compared to Q4.

Q4 (4 bits per weight) — recommended baseline

The sweet spot for most self-hosted deployments. Quality loss is minimal for most tasks, and the memory savings make larger models accessible. Q4 variants are the most widely available and best tested quantisation level.

Q5 (5 bits per weight)

Better quality than Q4 with a moderate increase in memory usage. Use Q5 when you have the VRAM headroom and want to preserve more of the original model's capability, especially for reasoning and coding tasks.

Q6 (6 bits per weight)

Quality is very close to the original FP16 model. The memory savings are smaller, but if you have the VRAM, Q6 is a safe choice for production workloads where quality consistency matters.

Q8 (8 bits per weight)

Nearly indistinguishable from the original model. The memory savings are modest — roughly half the FP16 size — but the quality preservation is excellent. Use Q8 when VRAM is not your primary constraint and you want maximum fidelity.

How VRAM translates to model size

A rough formula: a full-precision 7B model needs roughly 14 GB of memory. A Q4 version of the same model needs roughly 4-5 GB. A Q8 version needs roughly 8-9 GB.

For a 13B model: full precision needs ~26 GB, Q4 needs ~8-9 GB, Q8 needs ~14 GB.

This is why quantisation is transformative for self-hosted AI. It can make a model that would require a datacentre GPU run on a consumer card.

For hardware sizing guidance, see How Much RAM Do You Need for Local LLMs?.

Choosing the right level for your workload

Interactive chat

Q4 or Q5 is the right range for interactive use. The quality is high enough that most users cannot tell the difference from the full model, and the inference speed is good.

Batch processing

Q6 or Q8 makes sense for batch processing where quality matters more than latency. The larger file size is less of a concern when you are not constrained by real-time interaction.

Classification and extraction

Q3 or Q4 is sufficient for simple classification and structured extraction tasks. These workloads are less sensitive to subtle quality differences.

Coding and reasoning

Q5 or Q6 is recommended for coding and reasoning tasks. These areas are more sensitive to quantisation noise, and the extra quality headroom makes a measurable difference.

Practical tips

Always test the quantised model against a set of representative prompts before committing to it
Use the same quantisation level across models when comparing performance
The best quantisation is the highest one that fits comfortably in your VRAM with room for context
GGUF models from reputable sources like TheBloke on Hugging Face are well-tested and reliable

Conclusion

GGUF quantisation makes local LLMs practical on consumer hardware. Choosing the right level is a trade-off between quality and memory, but for most self-hosted deployments, Q4 or Q5 delivers excellent results without overcommitting resources.

FAQ

Can I convert any model to GGUF?

Most open-weight models can be converted to GGUF using the llama.cpp conversion scripts. Some architectures have specific requirements.

Does quantisation affect safety alignment?

Quantisation can sometimes reduce the effectiveness of safety alignment. Test your specific model and quantisation combination.

Which quantisation should beginners use?

Start with Q4. It is the most tested, most widely available, and delivers the best balance of quality and memory usage for most beginners.

GGUF Quantisation Guide: Choosing the Right Format for Your Local LLM

GGUF Quantisation Guide: Choosing the Right Format for Your Local LLM

What GGUF does

Quantisation levels explained

Q2 (2 bits per weight)

Q3 (3 bits per weight)

Q4 (4 bits per weight) — recommended baseline

Q5 (5 bits per weight)

Q6 (6 bits per weight)

Q8 (8 bits per weight)

How VRAM translates to model size

Choosing the right level for your workload

Interactive chat

Batch processing

Classification and extraction

Coding and reasoning

Practical tips

Conclusion

FAQ

Can I convert any model to GGUF?

Does quantisation affect safety alignment?

Which quantisation should beginners use?

Related articles

Optimising Embedding Models for Domain-Specific Document Retrieval

Best Embedding Models for Local RAG Systems in 2026

Gemma 3 Local Setup: Run Google's Open-Weight Models on Your Own Hardware