Tutorials

Fine-Tuning Ollama for Maximum Performance on Low-Memory Hardware

Run local LLMs on machines with 8-16 GB RAM using quantisation, context reduction, layer offloading, and Ollama configuration tricks.

Robson PereiraMay 31, 202611 min read
Ollama terminal showing memory-optimised model running on a mini PC.

Fine-Tuning Ollama for Maximum Performance on Low-Memory Hardware

Running local AI on a machine with 8 GB or 16 GB of RAM is entirely possible — you just cannot throw any model at it. The trick is understanding how Ollama handles memory, which configuration knobs actually help, and where to compromise without making output useless.

This guide covers practical techniques for getting usable inference on older laptops, mini PCs, and budget homelab hardware.

Understanding Memory Constraints

Local LLM inference uses three memory pools:

| Memory pool | What it holds | Impact on low-RAM systems |

|-------------|---------------|---------------------------|

| **Model weights** | The neural network parameters | Fixed per quantisation level |

| **KV cache** | Attention context during generation | Scales with context length |

| **System overhead** | OS, other services, Ollama runtime | Varies by 1-3 GB |

On an 8 GB machine, you have approximately 5-6 GB available after the OS and basic services. That means your model needs to fit in roughly 4 GB to leave room for a small context window.

For hardware recommendations that match these constraints, see Best Hardware for Self-Hosted AI.

Choosing the Right Quantisation

Quantisation is your most powerful lever.

| Quantisation | Bits per weight | 3B model size | 7B model size | 8 GB viability |

|-------------|----------------|---------------|---------------|----------------|

| Q4_K_M | 4.5 bits | ~2 GB | ~4.5 GB | 7B possible with small context |

| Q3_K_M | 3.5 bits | ~1.5 GB | ~3.5 GB | 7B comfortable |

| Q2_K | 2.5 bits | ~1 GB | ~2.5 GB | 7B with room for context |

| IQ2_XXS | 2.1 bits | ~0.9 GB | ~2 GB | Very tight but works |

Practical recommendations by RAM

**8 GB RAM:**

```bash

ollama pull qwen3:1.5b-q4_k_m # ~1 GB, fast responses

ollama pull phi-4:3.8b-q3_k_m # ~1.5 GB, good quality

ollama pull gemma-3:4b-q2_k # ~1.5 GB, decent for its size

```

**16 GB RAM:**

```bash

ollama pull llama3.2:7b-q4_k_m # ~4.5 GB, solid general purpose

ollama pull mistral:7b-q4_k_m # ~4.5 GB, efficient architecture

ollama pull qwen3:8b-q3_k_m # ~3.5 GB, good reasoning

```

Reducing the Context Window

The KV cache grows linearly with context length. A 7B model with 8192 tokens of context uses roughly 1 GB for the cache alone. Reducing to 2048 tokens frees 750 MB.

```bash

Run with reduced context

ollama run llama3.2:7b-q4_k_m --num-ctx 2048

For even tighter memory

ollama run llama3.2:7b-q4_k_m --num-ctx 1024 --num-batch 1

```

Set this permanently in Modelfiles for models you use frequently:

```dockerfile

FROM llama3.2:7b-q4_k_m

PARAMETER num_ctx 2048

PARAMETER num_batch 1

```

Layer Offloading Strategies

On systems with integrated GPUs or small discrete GPUs, offloading even a few layers to the GPU can speed things up while keeping most memory free for the system.

```bash

Offload 10 layers to GPU, keep rest in system RAM

ollama run llama3.2:7b-q4_k_m --num-gpu 10

Check GPU memory before and after

nvidia-smi --query-gpu=memory.used --format=csv

```

If GPU memory is limited (4 GB or less), offloading too many layers can cause OOM errors. Start with --num-gpu 5 and increase until you hit the limit.

Managing Swap Effectively

Linux swap can extend usable memory at a speed cost. Configure it specifically for inference:

```bash

Check current swap

free -h

Increase swap to 8 GB for inference workloads

sudo fallocate -l 8G /swapfile

sudo chmod 600 /swapfile

sudo mkswap /swapfile

sudo swapon /swapfile

```

Set swappiness lower so the kernel prefers RAM but still uses swap gracefully:

```bash

sudo sysctl vm.swappiness=10

```

Model Selection Guide for Low-Memory Setups

Not all model families are created equal on constrained hardware:

| Model | Size | RAM needed (Q4_K_M) | Best for |

|-------|------|--------------------|----------|

| Phi-4 | 3.8B | ~3 GB | Reasoning, code, instruction following |

| Gemma 3 | 4B | ~3.5 GB | General chat, summarisation |

| Qwen 3 | 1.5B/8B | ~1.5 GB / ~6 GB | Speed / quality trade-off |

| Llama 3.2 | 1B/7B | ~1 GB / ~5 GB | Balanced general purpose |

| Mistral | 7B | ~4.5 GB | Efficient architecture, good quality |

For a deeper comparison, see Mistral vs Llama vs Qwen.

Daily Workflow Optimisations

Batch processing

When processing multiple documents, batch them into a single prompt instead of making separate requests. This reuses the KV cache:

```bash

Instead of 10 separate calls, combine into one

ollama run qwen3:1.5b-q4_k_m <<EOF

Summarise each of the following items in one sentence each:

1. [item 1]

2. [item 2]

...

EOF

```

Session reuse

Keep Ollama running between requests instead of starting fresh each time. The model stays loaded in memory:

```bash

Ollama keeps the model warm between interactive sessions

ollama run qwen3:1.5b-q4_k_m

Type queries one after another without restarting

```

Monitoring Memory Usage

```bash

Real-time memory monitoring during inference

watch -n 2 'free -h | head -3'

Ollama-specific GPU memory

watch -n 2 'ollama ps'

```

If swap usage climbs above 2 GB during generation, the model is too large for your hardware. Drop to a smaller quantisation or a smaller model.

Conclusion

Low-memory hardware can run local AI successfully with the right approach. Prioritise small models with aggressive quantisation, keep context windows minimal, and optimise swap configuration. The Phi-4 and Qwen 3 1.5B families are excellent starting points for 8 GB machines, while 16 GB systems can comfortably run 7B models at Q4_K_M with a modest context window.

FAQ

Can I run a 7B model on 8 GB RAM?

Yes, with Q3_K_M quantisation and a context window of 2048 tokens or fewer. Expect slower generation and avoid running other memory-intensive applications simultaneously.

Does CPU-only inference use less RAM than GPU?

CPU inference loads the entire model into system RAM. GPU offloading moves part of it to VRAM, which can free system RAM if the GPU has dedicated memory.

Will swap make inference usable?

Swap prevents crashes but slows generation significantly. Aim to keep the model within physical RAM for acceptable token generation speed.

**Sources:**

Related articles