Tutorials
Ollama Power User Tips: Advanced Usage for Local Model Management
Go beyond ollama pull and ollama run with advanced features: custom Modelfiles, parallel requests, API usage, model management, and automation scripts.

Ollama Power User Tips: Advanced Usage for Local Model Management
If you have been using Ollama to pull and run models from the command line, you are only using the surface layer. Ollama's real power shows up when you start writing custom Modelfiles, running parallel inference, scripting the API, and managing multiple models efficiently.
Start with the basics in How to Run Llama 3 Locally with Ollama if you have not set it up yet.
Custom Modelfiles
A Modelfile lets you customise how a model behaves without modifying the original weights. Create a text file that sets the system prompt, adjusts temperature, adds template formatting, and imports the base model:
```dockerfile
FROM llama3
SYSTEM "You are a technical writing assistant. Write clearly and concisely. Use UK English."
PARAMETER temperature 0.4
PARAMETER top_p 0.9
TEMPLATE "[INST] {\{.Prompt\} } [/INST]"
```
Build it with `ollama create my-writer -f ./Modelfile` and run it like any other model. Modelfiles are version-controllable, so you can keep custom model configurations in your dotfiles or project repos.
Parallel requests
By default, Ollama processes requests sequentially. For concurrent requests — multiple users or a script sending batches — set the OLLAMA_NUM_PARALLEL environment variable:
```bash
export OLLAMA_NUM_PARALLEL=4
ollama serve
```
This is essential when you use Ollama as a backend for Open WebUI with multiple team members. Without parallel requests, one long response blocks everyone else.
API automation
Ollama's API is straightforward and well-documented. You can call it from scripts, cron jobs, or workflow tools like n8n:
```bash
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "Summarise this text in three sentences",
"stream": false
}'
```
For application integration, see Build Your Own AI Assistant with n8n.
Model management tricks
Delete unused models to free disk space: `ollama rm modelname`. List all downloaded models: `ollama list`. See which model is currently loaded in memory: check `ollama ps`.
Ollama keeps recently used models cached in memory. If you switch between models frequently, increase OLLAMA_KEEP_ALIVE to keep them loaded longer, or decrease it to free VRAM faster.
GPU and memory tuning
Set OLLAMA_GPU_OVERHEAD to reserve VRAM for the operating system and other applications. On multi-GPU systems, OLLAMA_GPU_LAYERS controls how many layers run on each GPU. Use OLLAMA_FLASH_ATTENTION=1 to reduce memory usage for long contexts.
Monitor VRAM during inference with `nvidia-smi` or `ollama ps` to find the right balance.
Conclusion
Ollama is deceptively simple at first glance. The Modelfile system, parallel request support, and scripting API turn it into a serious local inference platform. Invest time in learning these features — they make the difference between ollama as a toy and ollama as infrastructure.
FAQ
Can Modelfiles change the model weights?
No. Modelfiles only change system prompts, parameters, and templates. The underlying weights remain unchanged.
Does parallel inference require more VRAM?
Yes. Each parallel request consumes additional VRAM. Test with conservative settings and monitor memory.
How do I make Ollama start on boot?
Use your operating system's service manager. On Linux, `systemctl enable ollama` if installed via the package manager.


