Tutorials

Run OpenAI GPT-OSS 120B Locally: Set Up OpenAI's First Open-Source Model

OpenAI open-sourced GPT-OSS 120B and 20B. Here is how to run them locally with Ollama and what hardware you need.

Robson PereiraMay 31, 202610 min read
OpenAI GPT-OSS model running locally with Ollama on a self-hosted server.

Run OpenAI GPT-OSS 120B Locally: Set Up OpenAI's First Open-Source Model

OpenAI released its first open-weight model under the name **GPT-OSS**, and the self-hosted AI community has responded with a surge of interest — the 120B parameter variant has already passed 4.7 million downloads on Hugging Face. For anyone running local AI, this is a watershed moment: a frontier-grade model from the lab that defined modern LLMs, now available to run on your own hardware.

This guide covers what GPT-OSS is, how to set it up locally with Ollama, the hardware you need for each variant, and how to get usable performance from quantised versions.

What Is GPT-OSS and Why Does It Matter?

GPT-OSS comes in two sizes: **GPT-OSS-120B** (the full flagship) and **GPT-OSS-20B** (a more accessible variant). Both are decoder-only transformer models released under a permissive open licence that allows commercial use.

The significance is hard to overstate. OpenAI already offers GPT through its API, but having the actual model weights available locally means:

  • **Complete data privacy** — every prompt and response stays on your machine
  • **No API costs** — pay for hardware once, run unlimited queries
  • **Full control** — fine-tune, quantise, or modify the model as you see fit
  • **Offline operation** — no internet connection required after download

For a broader comparison of private versus cloud AI, see Private AI vs Cloud AI which covers the cost and control trade-offs in detail.

Hardware Requirements

The 120B variant is demanding but feasible on high-end consumer hardware with quantisation.

GPT-OSS-120B (unquantised)

  • **VRAM:** ~240 GB (FP16) — requires multiple GPUs or an A100/H100
  • **System RAM:** 128 GB minimum
  • **Storage:** 240 GB free

GPT-OSS-20B

  • **VRAM:** ~40 GB (FP16) — fits on dual RTX 3090/4090 or a single A6000
  • **System RAM:** 64 GB
  • **Storage:** 40 GB free

| Quantisation | 120B VRAM | 20B VRAM | Quality trade-off |

|-------------|-----------|----------|-------------------|

| Q4_K_M (4-bit) | ~68 GB | ~12 GB | Minimal quality loss |

| Q5_K_M (5-bit) | ~84 GB | ~15 GB | Near-lossless |

| Q8_0 (8-bit) | ~130 GB | ~22 GB | Virtually identical to FP16 |

| IQ2_XXS (2-bit) | ~36 GB | ~7 GB | Noticeable quality drop, useful for limited hardware |

For hardware planning, read Best Hardware for Self-Hosted AI to size your build correctly.

Installing GPT-OSS with Ollama

Ollama now includes GPT-OSS in its default model catalogue — the description on the Ollama GitHub page explicitly lists "gpt-oss" alongside DeepSeek, Qwen, and Gemma. This makes installation straightforward.

Step 1: Install Ollama

If you do not already have Ollama, see How to Run Llama 3 Locally with Ollama for a complete setup walkthrough.

Step 2: Pull the GPT-OSS model

For the 20B variant (recommended starting point):

```bash

ollama pull gpt-oss-20b

```

For the full 120B model (requires significant VRAM):

```bash

ollama pull gpt-oss-120b

```

Ollama automatically downloads a quantised version optimised for your hardware. You can also specify a quantisation level explicitly:

```bash

ollama pull gpt-oss-20b:q4_k_m

```

Step 3: Run the model

```bash

ollama run gpt-oss-20b

```

Test it with a simple prompt:

```

Write a Python function that validates an email address using regex.

```

Performance Tuning for Self-Hosted Setups

Getting good performance from GPT-OSS requires some tuning, especially for the 120B variant.

Batch size and context length

Reduce the context window from the default if you are VRAM-constrained:

```bash

ollama run gpt-oss-20b --num-ctx 4096

```

For the 120B on limited hardware, try:

```bash

ollama run gpt-oss-120b --num-ctx 2048 --num-batch 1

```

GPU layer offloading

Ensure Ollama uses your GPU by checking the logs at startup. On multi-GPU systems, you can control layer distribution:

```bash

macOS (Metal)

export OLLAMA_METAL=1

CUDA devices

export CUDA_VISIBLE_DEVICES=0,1

```

Model parallelism

For the 120B variant across multiple GPUs, Ollama handles tensor parallelism automatically when multiple GPUs are detected. Monitor VRAM usage with `nvidia-smi` to ensure balanced distribution.

CPU fallback

If the model exceeds VRAM, Ollama offloads layers to system RAM automatically. Performance drops significantly when this happens — consider a smaller quantisation instead.

Using GPT-OSS with Chat Interfaces

GPT-OSS works with any interface that supports Ollama or an OpenAI-compatible endpoint.

Open WebUI

Add the GPT-OSS model to Open WebUI by connecting it to your Ollama server. The model appears automatically in the model picker. For the full setup, see our Open WebUI vs AnythingLLM comparison to choose the right interface.

AnythingLLM

AnywareLLM supports Ollama models natively. Point it at your Ollama instance and select GPT-OSS from the model dropdown.

Custom applications

GPT-OSS exposes an OpenAI-compatible API through Ollama:

```bash

curl http://localhost:11434/api/chat -d '{

"model": "gpt-oss-20b",

"messages": [{"role": "user", "content": "Hello!"}]

}'

```

Troubleshooting Common Issues

Out of memory errors

The most common problem. Drop to a smaller quantisation (Q4_K_M or IQ2_XXS) or use the 20B variant instead of 120B.

Slow token generation

Check that GPU offloading is working. If tokens arrive slower than 5 per second on a GPU system, verify that Ollama detected your GPU and that layers are being offloaded.

Model download fails

GPT-OSS is a large download (up to 140 GB for the 120B). Use a stable network connection, and consider using `ollama pull` with a download manager for resumable downloads.

Hallucinations or poor responses

The quantised 2-bit variant (IQ2_XXS) may produce degraded outputs. Use at least Q4_K_M for production-quality results.

Conclusion

OpenAI's GPT-OSS marks a genuine shift in the open-weight AI landscape. The 20B variant is immediately practical on high-end consumer hardware, and the 120B variant brings frontier capability within reach of well-equipped homelabs. Combined with Ollama's seamless integration, running OpenAI-grade models locally has never been more accessible.

Start with the 20B Q4_K_M variant to get a feel for performance, then scale up as your hardware allows. The era of truly private frontier AI is here.

FAQ

Is GPT-OSS truly open-source?

The model weights are released under a permissive licence that allows commercial use, modification, and redistribution. It is not quite "open source" in the OSI sense, but it is substantially more open than anything OpenAI has released before.

Can GPT-OSS run on a single consumer GPU?

The 20B variant at Q4_K_M quantisation fits on a single RTX 3090 or 4090 (24 GB VRAM). The 120B variant needs multiple GPUs or a datacentre card.

Does GPT-OSS work with Ollama on Apple Silicon?

Yes. The 20B variant runs well on M2/M3 Ultra machines with 64 GB+ unified memory using Metal acceleration.

How does GPT-OSS compare to Llama 3.1 70B?

Early tests show GPT-OSS-120B is competitive with Llama 3.1 70B on reasoning benchmarks while significantly stronger on coding tasks. The 20B variant trades some capability for accessibility.

**Sources:**

Related articles