Tutorials

DeepSeek V4 Pro: Run DeepSeek's Latest Reasoning Model on Your Own Hardware

DeepSeek V4 Pro is here with improved reasoning and massive context. Set it up locally with Ollama for private, state-of-the-art AI inference.

Robson PereiraMay 31, 20269 min read
DeepSeek V4 Pro reasoning model running on a local Ollama server.

DeepSeek V4 Pro: Run DeepSeek's Latest Reasoning Model on Your Own Hardware

DeepSeek V4 Pro has arrived on Hugging Face with over 5.9 million downloads and 4,400+ stars, making it one of the fastest-growing model releases of 2026. Building on the architecture of DeepSeek V3, V4 Pro brings improved reasoning depth, longer native context, and better adherence to system prompts — all within the same MoE (Mixture of Experts) parameter efficiency that made DeepSeek a favourite in the self-hosted community.

This guide covers how to run DeepSeek V4 Pro locally with Ollama, the hardware you need, multi-GPU setup, and prompt strategies that make the most of its reasoning capabilities.

What Makes DeepSeek V4 Pro Different

DeepSeek V4 Pro is a Mixture of Experts model with a total parameter count that rivals frontier models while keeping active parameters per token relatively low. That means it delivers strong reasoning performance without requiring the full VRAM budget of a dense model of comparable capability.

Key improvements over DeepSeek V3:

  • **Enhanced reasoning depth** — better performance on multi-step logic, math, and code reasoning tasks
  • **Longer native context** — handle significantly larger documents and conversations without external chunking
  • **Improved instruction following** — stricter adherence to system prompts and formatting instructions
  • **Better tool-use reasoning** — more reliable function calling for agent workflows

If you are new to MoE models and how they manage VRAM differently from dense architectures, our Best Hardware for Self-Hosted AI guide explains the practical implications.

Hardware Requirements

DeepSeek V4 Pro uses the same MoE architecture as V3, meaning it activates only a fraction of its total parameters for each token. This makes it more memory-efficient than a dense model of similar total size.

| Configuration | VRAM Required | Notes |

|--------------|---------------|-------|

| FP16 (full precision) | ~280 GB | Multi-GPU or datacentre hardware |

| Q8_0 (8-bit) | ~140 GB | Server-grade GPU setup |

| Q4_K_M (4-bit) | ~70 GB | Dual RTX 3090/4090 (48 GB total) with offloading |

| Q4_K_M (partial offload) | ~48 GB | Single RTX 3090/4090 + 64 GB system RAM |

| IQ2_XXS (2-bit) | ~24 GB | Single consumer GPU, quality trade-off |

Practical self-hosted configurations

**Dual RTX 3090/4090 (48 GB VRAM):** Run Q4_K_M with most layers on GPU and some CPU offloading. Expect 5–15 tokens per second.

**Single RTX 4090 (24 GB VRAM):** Run IQ2_XXS for reasonable quality, or Q4_K_M with heavy CPU offloading at slower speeds.

**Apple M3 Ultra (128 GB+ unified memory):** Excellent performance with Q4_K_M using Metal acceleration, often faster than GPU+CPU offloading on discrete cards.

**Dedicated server (A6000/A100):** Run Q8_0 or FP16 for maximum quality at high throughput.

Setting Up DeepSeek V4 Pro with Ollama

Ollama supports DeepSeek V4 Pro directly. The Ollama project now lists DeepSeek among its headline-supported model families, alongside Kimi-K2.5, GLM-5, and GPT-OSS.

Step 1: Pull the model

```bash

ollama pull deepseek-v4-pro

```

Ollama selects a sensible default quantisation for your hardware. To choose a specific quantisation:

```bash

ollama pull deepseek-v4-pro:q4_k_m

```

Step 2: Run the model

```bash

ollama run deepseek-v4-pro

```

Test with a reasoning prompt:

```

Calculate how many R's are in the word "strawberry". Think step by step.

```

Step 3: Multi-GPU configuration

For multi-GPU setups, set the GPU devices before starting Ollama:

```bash

export CUDA_VISIBLE_DEVICES=0,1

ollama run deepseek-v4-pro:q4_k_m

```

Ollama distributes layers across visible GPUs automatically. Use `nvidia-smi` to verify balanced VRAM usage.

Prompt Strategies for DeepSeek V4 Pro

DeepSeek reasoning models respond well to structured prompting.

Chain-of-thought prompting

DeepSeek V4 Pro excels at step-by-step reasoning. Explicitly request reasoning steps for complex tasks:

```

Analyse this code for potential race conditions. First, identify all shared state. Second, trace each goroutine's access pattern. Third, propose fixes.

```

System prompt design

DeepSeek V4 Pro follows system prompts more closely than V3. Use this to set behaviour consistently:

```

You are a senior DevOps engineer. Be concise, provide commands that work, and explain trade-offs.

```

Long-context usage

With its longer native context window, DeepSeek V4 Pro can process entire codebases or documents without summarisation. Feed the content directly:

```

Read the following codebase structure and suggest an architecture refactor: [paste content]

```

Tool-use prompts

For agent workflows, DeepSeek V4 Pro's improved function calling means you can rely on structured tool definitions. For patterns and patterns, see how n8n connects to local models for AI assistants.

Using DeepSeek V4 Pro with a Chat Interface

Open WebUI

Connect Open WebUI to your Ollama instance. DeepSeek V4 Pro appears in the model selector automatically. For RAG workflows, see How to Add Local Documents to Open WebUI with RAG.

AnythingLLM

Add DeepSeek V4 Pro as a custom model endpoint in AnythingLLM for document-grounded conversations with long-context reasoning.

API access

DeepSeek V4 Pro exposes an OpenAI-compatible API through Ollama:

```bash

curl http://localhost:11434/api/generate -d '{

"model": "deepseek-v4-pro",

"prompt": "Explain quantum computing in simple terms.",

"stream": false

}'

```

Comparing DeepSeek V4 Pro to Other Models

DeepSeek V4 Pro fills a specific niche in the self-hosted model landscape:

| Model | Active params | Best for | Hardware tier |

|-------|--------------|----------|---------------|

| DeepSeek V4 Pro | ~37B active | Reasoning, code, long context | High-end (24 GB+) |

| GPT-OSS-20B | 20B dense | General purpose, writing | Mid-range (12 GB+) |

| Kimi-K2.5 | ~20B dense | Research, analysis | Mid-range (16 GB+) |

| Llama 3.1 70B | 70B dense | Balanced capability | High-end (48 GB+) |

| Qwen 3 8B | 8B dense | Speed, simple tasks | Any machine |

Troubleshooting

Model fails to load

If Ollama reports an out-of-memory error, switch to a smaller quantisation. The Q4_K_M variant requires ~70 GB VRAM for full GPU offloading.

Slow generation on single GPU

Reduce the context window and batch size:

```bash

ollama run deepseek-v4-pro:q4_k_m --num-ctx 4096 --num-batch 1

```

Poor quality responses

Ensure you are not using the IQ2_XXS quantisation for critical work. The quality drop is noticeable for reasoning tasks. Q4_K_M or higher is recommended.

Metal (Apple Silicon) acceleration

On macOS, verify Metal is enabled:

```bash

export OLLAMA_METAL=1

ollama run deepseek-v4-pro:q4_k_m

```

Conclusion

DeepSeek V4 Pro is one of the strongest reasoning models available for self-hosted deployment. Its MoE architecture makes it more accessible than dense models of comparable total size, and Ollama's integration means setup takes minutes rather than hours. Start with Q4_K_M on the most capable GPU you have, and adjust quantisation based on your throughput and quality requirements.

FAQ

Is DeepSeek V4 Pro better than DeepSeek V3?

Yes — V4 Pro shows measurable improvements in reasoning benchmarks, instruction following, and context utilisation. It is a meaningful upgrade rather than a minor point release.

Can I run DeepSeek V4 Pro on a single GPU?

The Q4_K_M variant needs GPU offloading and system RAM on a single 24 GB card. The IQ2_XXS variant fits entirely in 24 GB VRAM but with quality degradation.

Does Ollama support DeepSeek V4 Pro natively?

Yes. Ollama includes DeepSeek V4 Pro in its model catalogue and handles quantisation, layer distribution, and API exposure automatically.

What context length does DeepSeek V4 Pro support?

DeepSeek V4 Pro supports a significantly longer native context than V3, comfortably handling entire codebases or long documents in a single pass.

**Sources:**

Related articles