Tutorials

10 Essential Ollama Tips for Power Users

Advanced Ollama workflows for parallel models, custom modelfiles, environment tuning, and integration with external tools.

Robson PereiraMay 31, 20269 min read
Ollama terminal interface with multiple running models.

10 Essential Ollama Tips for Power Users

If you have already installed Ollama and pulled a model, you know the basics. These tips will help you get more from the runtime — parallel models, custom configurations, and deeper integration with your local AI stack.

1. Run multiple models simultaneously

Ollama can keep multiple models loaded at the same time by adjusting the environment variable `OLLAMA_NUM_PARALLEL`. Set it to 2 or more and the server will handle concurrent requests across different model slots.

For hardware planning, read Best Hardware for Self-Hosted AI before increasing parallelism on a tight machine.

2. Create custom modelfiles

Modelfiles let you customise a model without re-downloading weights. You can set system prompts, adjust temperature and top-p, add template instructions, and define stop sequences. Save your modelfiles in a version-controlled directory so configurations are repeatable.

3. Use the API for everything

Ollama's REST API at `http://localhost:11434/api` gives you endpoints for chat, completion, embedding, model listing, and pulling. This is how you connect Ollama to external interfaces, automation tools, and custom scripts.

For interface pairing, compare Open WebUI vs AnythingLLM.

4. Tune context length per request

You can set `num_ctx` in each API call or modelfile, overriding the default context window. Use shorter contexts for speed and longer contexts for document-heavy tasks. The right context length depends on your GPU memory and the model's maximum.

5. Pin specific model versions

Ollama automatically tags model versions. Use `ollama pull llama3.2:3b-instruct-q4_K_M` or similar precise tags so your pinned versions stay stable until you explicitly update them.

6. Set up environment variables

`OLLAMA_HOST`, `OLLAMA_MODELS`, `OLLAMA_KEEP_ALIVE`, and `OLLAMA_MAX_LOADED_MODELS` tune the server behaviour. Keep `OLLAMA_KEEP_ALIVE` short if memory is tight, or extend it for frequently used models.

7. Use embedding endpoints for RAG

Ollama supports embedding generation through `/api/embed`. This removes the need for a separate embedding service when building a local RAG pipeline. For pipeline design, see Build a Local RAG Pipeline That Actually Answers Questions.

8. Watch logs during troubleshooting

`journalctl -u ollama` or `ollama serve` in verbose mode reveals loading times, memory allocation, and error messages. Logs are often the fastest path to diagnosing slow responses or model load failures.

9. Offload layers to GPU

Set `OLLAMA_GPU_OVERHEAD` and use modelfile parameters like `gpu_layers` to control how many model layers run on GPU. This matters when you have limited VRAM and need to balance speed and memory.

10. Script regular model pruning

Models accumulate over time. Write a weekly script to list installed models and remove those you no longer use. Keep a tidy model cache to save disk space and reduce confusion.

Conclusion

Ollama is simple by design, but the runtime supports considerable depth. These tips turn basic usage into a reliable, repeatable local AI foundation that integrates well with the rest of your stack.

FAQ

Can Ollama run in Docker?

Yes. Docker deployment is common, especially alongside Open WebUI and other containers.

How do I migrate models to another machine?

Copy the `~/.ollama/models` directory, or re-pull the same tags on the new machine.

Does Ollama support GPU acceleration on Windows?

Yes, with NVIDIA GPUs via CUDA. ROCm support is available for AMD GPUs on Linux.

Related articles