Tutorials

Ollama Power User Tips: Advanced Usage for Local Model Management

Go beyond ollama pull and ollama run with advanced features: custom Modelfiles, parallel requests, API usage, model management, and automation scripts.

Robson PereiraMay 31, 20268 min read

Ollama terminal showing advanced commands for model management and parallel requests.

Ollama Power User Tips: Advanced Usage for Local Model Management

If you have been using Ollama to pull and run models from the command line, you are only using the surface layer. Ollama's real power shows up when you start writing custom Modelfiles, running parallel inference, scripting the API, and managing multiple models efficiently.

Start with the basics in How to Run Llama 3 Locally with Ollama if you have not set it up yet.

Custom Modelfiles

A Modelfile lets you customise how a model behaves without modifying the original weights. Create a text file that sets the system prompt, adjusts temperature, adds template formatting, and imports the base model:

```dockerfile

FROM llama3

SYSTEM "You are a technical writing assistant. Write clearly and concisely. Use UK English."

PARAMETER temperature 0.4

PARAMETER top_p 0.9

TEMPLATE "[INST] {\{.Prompt\} } [/INST]"

```

Build it with `ollama create my-writer -f ./Modelfile` and run it like any other model. Modelfiles are version-controllable, so you can keep custom model configurations in your dotfiles or project repos.

Parallel requests

By default, Ollama processes requests sequentially. For concurrent requests — multiple users or a script sending batches — set the OLLAMA_NUM_PARALLEL environment variable:

```bash

export OLLAMA_NUM_PARALLEL=4

ollama serve

```

This is essential when you use Ollama as a backend for Open WebUI with multiple team members. Without parallel requests, one long response blocks everyone else.

API automation

Ollama's API is straightforward and well-documented. You can call it from scripts, cron jobs, or workflow tools like n8n:

```bash

curl http://localhost:11434/api/generate -d '{

"model": "llama3",

"prompt": "Summarise this text in three sentences",

"stream": false

```

For application integration, see Build Your Own AI Assistant with n8n.

Model management tricks

Delete unused models to free disk space: `ollama rm modelname`. List all downloaded models: `ollama list`. See which model is currently loaded in memory: check `ollama ps`.

Ollama keeps recently used models cached in memory. If you switch between models frequently, increase OLLAMA_KEEP_ALIVE to keep them loaded longer, or decrease it to free VRAM faster.

GPU and memory tuning

Set OLLAMA_GPU_OVERHEAD to reserve VRAM for the operating system and other applications. On multi-GPU systems, OLLAMA_GPU_LAYERS controls how many layers run on each GPU. Use OLLAMA_FLASH_ATTENTION=1 to reduce memory usage for long contexts.

Monitor VRAM during inference with `nvidia-smi` or `ollama ps` to find the right balance.

Conclusion

Ollama is deceptively simple at first glance. The Modelfile system, parallel request support, and scripting API turn it into a serious local inference platform. Invest time in learning these features — they make the difference between ollama as a toy and ollama as infrastructure.

FAQ

Can Modelfiles change the model weights?

No. Modelfiles only change system prompts, parameters, and templates. The underlying weights remain unchanged.

Does parallel inference require more VRAM?

Yes. Each parallel request consumes additional VRAM. Test with conservative settings and monitor memory.

How do I make Ollama start on boot?

Use your operating system's service manager. On Linux, `systemctl enable ollama` if installed via the package manager.

Ollama Power User Tips: Advanced Usage for Local Model Management

Ollama Power User Tips: Advanced Usage for Local Model Management

Custom Modelfiles

Parallel requests

API automation

Model management tricks

GPU and memory tuning

Conclusion

FAQ

Can Modelfiles change the model weights?

Does parallel inference require more VRAM?

How do I make Ollama start on boot?

Related articles

How to Add Local Documents to Open WebUI with RAG and Ollama

How to Deploy Open WebUI and Ollama on a Private LAN with Docker Compose

How to Build a Self-Hosted AI Workstation with Docker and Multiple Model Runners