Tutorials
Fine-Tuning Ollama for Maximum Performance on Low-Memory Hardware
Run local LLMs on machines with 8-16 GB RAM using quantisation, context reduction, layer offloading, and Ollama configuration tricks.

Fine-Tuning Ollama for Maximum Performance on Low-Memory Hardware
Running local AI on a machine with 8 GB or 16 GB of RAM is entirely possible — you just cannot throw any model at it. The trick is understanding how Ollama handles memory, which configuration knobs actually help, and where to compromise without making output useless.
This guide covers practical techniques for getting usable inference on older laptops, mini PCs, and budget homelab hardware.
Understanding Memory Constraints
Local LLM inference uses three memory pools:
| Memory pool | What it holds | Impact on low-RAM systems |
|-------------|---------------|---------------------------|
| **Model weights** | The neural network parameters | Fixed per quantisation level |
| **KV cache** | Attention context during generation | Scales with context length |
| **System overhead** | OS, other services, Ollama runtime | Varies by 1-3 GB |
On an 8 GB machine, you have approximately 5-6 GB available after the OS and basic services. That means your model needs to fit in roughly 4 GB to leave room for a small context window.
For hardware recommendations that match these constraints, see Best Hardware for Self-Hosted AI.
Choosing the Right Quantisation
Quantisation is your most powerful lever.
| Quantisation | Bits per weight | 3B model size | 7B model size | 8 GB viability |
|-------------|----------------|---------------|---------------|----------------|
| Q4_K_M | 4.5 bits | ~2 GB | ~4.5 GB | 7B possible with small context |
| Q3_K_M | 3.5 bits | ~1.5 GB | ~3.5 GB | 7B comfortable |
| Q2_K | 2.5 bits | ~1 GB | ~2.5 GB | 7B with room for context |
| IQ2_XXS | 2.1 bits | ~0.9 GB | ~2 GB | Very tight but works |
Practical recommendations by RAM
**8 GB RAM:**
```bash
ollama pull qwen3:1.5b-q4_k_m # ~1 GB, fast responses
ollama pull phi-4:3.8b-q3_k_m # ~1.5 GB, good quality
ollama pull gemma-3:4b-q2_k # ~1.5 GB, decent for its size
```
**16 GB RAM:**
```bash
ollama pull llama3.2:7b-q4_k_m # ~4.5 GB, solid general purpose
ollama pull mistral:7b-q4_k_m # ~4.5 GB, efficient architecture
ollama pull qwen3:8b-q3_k_m # ~3.5 GB, good reasoning
```
Reducing the Context Window
The KV cache grows linearly with context length. A 7B model with 8192 tokens of context uses roughly 1 GB for the cache alone. Reducing to 2048 tokens frees 750 MB.
```bash
Run with reduced context
ollama run llama3.2:7b-q4_k_m --num-ctx 2048
For even tighter memory
ollama run llama3.2:7b-q4_k_m --num-ctx 1024 --num-batch 1
```
Set this permanently in Modelfiles for models you use frequently:
```dockerfile
FROM llama3.2:7b-q4_k_m
PARAMETER num_ctx 2048
PARAMETER num_batch 1
```
Layer Offloading Strategies
On systems with integrated GPUs or small discrete GPUs, offloading even a few layers to the GPU can speed things up while keeping most memory free for the system.
```bash
Offload 10 layers to GPU, keep rest in system RAM
ollama run llama3.2:7b-q4_k_m --num-gpu 10
Check GPU memory before and after
nvidia-smi --query-gpu=memory.used --format=csv
```
If GPU memory is limited (4 GB or less), offloading too many layers can cause OOM errors. Start with --num-gpu 5 and increase until you hit the limit.
Managing Swap Effectively
Linux swap can extend usable memory at a speed cost. Configure it specifically for inference:
```bash
Check current swap
free -h
Increase swap to 8 GB for inference workloads
sudo fallocate -l 8G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
```
Set swappiness lower so the kernel prefers RAM but still uses swap gracefully:
```bash
sudo sysctl vm.swappiness=10
```
Model Selection Guide for Low-Memory Setups
Not all model families are created equal on constrained hardware:
| Model | Size | RAM needed (Q4_K_M) | Best for |
|-------|------|--------------------|----------|
| Phi-4 | 3.8B | ~3 GB | Reasoning, code, instruction following |
| Gemma 3 | 4B | ~3.5 GB | General chat, summarisation |
| Qwen 3 | 1.5B/8B | ~1.5 GB / ~6 GB | Speed / quality trade-off |
| Llama 3.2 | 1B/7B | ~1 GB / ~5 GB | Balanced general purpose |
| Mistral | 7B | ~4.5 GB | Efficient architecture, good quality |
For a deeper comparison, see Mistral vs Llama vs Qwen.
Daily Workflow Optimisations
Batch processing
When processing multiple documents, batch them into a single prompt instead of making separate requests. This reuses the KV cache:
```bash
Instead of 10 separate calls, combine into one
ollama run qwen3:1.5b-q4_k_m <<EOF
Summarise each of the following items in one sentence each:
1. [item 1]
2. [item 2]
...
EOF
```
Session reuse
Keep Ollama running between requests instead of starting fresh each time. The model stays loaded in memory:
```bash
Ollama keeps the model warm between interactive sessions
ollama run qwen3:1.5b-q4_k_m
Type queries one after another without restarting
```
Monitoring Memory Usage
```bash
Real-time memory monitoring during inference
watch -n 2 'free -h | head -3'
Ollama-specific GPU memory
watch -n 2 'ollama ps'
```
If swap usage climbs above 2 GB during generation, the model is too large for your hardware. Drop to a smaller quantisation or a smaller model.
Conclusion
Low-memory hardware can run local AI successfully with the right approach. Prioritise small models with aggressive quantisation, keep context windows minimal, and optimise swap configuration. The Phi-4 and Qwen 3 1.5B families are excellent starting points for 8 GB machines, while 16 GB systems can comfortably run 7B models at Q4_K_M with a modest context window.
FAQ
Can I run a 7B model on 8 GB RAM?
Yes, with Q3_K_M quantisation and a context window of 2048 tokens or fewer. Expect slower generation and avoid running other memory-intensive applications simultaneously.
Does CPU-only inference use less RAM than GPU?
CPU inference loads the entire model into system RAM. GPU offloading moves part of it to VRAM, which can free system RAM if the GPU has dedicated memory.
Will swap make inference usable?
Swap prevents crashes but slows generation significantly. Aim to keep the model within physical RAM for acceptable token generation speed.
**Sources:**


