Models

How to Choose the Right Local Model Size

Match model size to your hardware, latency target, and task before you chase benchmark hype.

Robson PereiraMay 27, 20268 min read
Local AI model selection concept with multiple model sizes.

How to Choose the Right Local Model Size

Model size is one of the easiest things to overthink. In practice, the right choice is usually the model that fits your hardware comfortably and still answers your real prompts well enough to be useful.

Start from the task, not the leaderboard

If you want quick drafting, classification, or summarisation, a smaller model may be perfect. If you need deeper reasoning or broader context, step up carefully and measure the cost in latency and memory.

Use How to Run Llama 3 Locally with Ollama to see how a practical runtime changes the experience.

Fit the model to the machine

The best model on paper is useless if it causes swap storms or forces constant offloading. Match the size to your GPU VRAM or available system memory, then leave enough headroom for the rest of the stack.

Keep a small benchmark set

Five or six prompts that reflect your own work are enough to expose whether a model is too slow, too vague, or too expensive to run all day.

Use one model per job where possible

Chat, coding, embeddings, and image analysis all have different sweet spots. A smaller specialist can outperform a larger general model when the task is narrow and repetitive.

Pair your selection with Open WebUI vs AnythingLLM if you want to test how different interfaces affect model use.

Conclusion

Choose the model that works on your hardware, for your workload, with your tolerance for latency. The right size is usually smaller than your instincts first suggest.

Related articles