Tutorials
TabbyAPI Quick Start: Deploy an OpenAI-Compatible Local API Server
Download, configure, and run TabbyAPI as a lightweight OpenAI-compatible inference server for local LLMs on Linux and Windows.

TabbyAPI Quick Start: Deploy an OpenAI-Compatible Local API Server
TabbyAPI is a lightweight, Rust-based inference server that exposes an OpenAI-compatible API for local language models. If you want a minimal, fast backend that follows the OpenAI API format closely, this is worth evaluating alongside the bigger toolkits.
What makes TabbyAPI different
TabbyAPI is built around speed and API compliance. It uses ExLlamaV2 and Llama.cpp backends, keeps memory overhead low, and provides the standard `/v1/chat/completions` and `/v1/completions` endpoints without requiring extensions or additional configuration.
For a broader comparison of local servers, see TabbyAPI vs text-generation-webui.
Installation
Download the pre-built binary from the releases page or build from source with Cargo. On Linux, the binary is self-contained and requires only the CUDA or ROCm runtime for GPU acceleration. Place the binary in a directory on your `PATH` for convenient access.
Configure and run
Create a configuration file specifying the model path, backend (ExLlamaV2 or Llama.cpp), and inference parameters. TabbyAPI loads the model during startup and begins listening on the configured port.
First inference test
Once the server is running, send a test request with curl:
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"default","messages":[{"role":"user","content":"Hello"}]}'
```
Connect to Open WebUI
TabbyAPI works directly with Open WebUI as an OpenAI-compatible backend. In Open WebUI's admin settings, add a new OpenAI connection pointing to `http://localhost:8000/v1`. Models served by TabbyAPI will appear in the model selector.
Read Open WebUI Setup for Local Documents for the full interface configuration.
Performance tuning
Batch size and context length
Adjust the batch size and maximum context length in the configuration file to match your GPU memory. Larger batches improve throughput for concurrent requests; longer context windows increase memory usage per request.
Keep multiple models ready
TabbyAPI supports loading multiple model configurations and switching between them via API calls. You can keep a fast model for simple queries and a larger model for complex reasoning tasks.
Conclusion
TabbyAPI fills a specific niche: a fast, minimal OpenAI-compatible server for local LLMs. If you value API compliance and low overhead over a web interface and extensions, it is a strong choice for the inference layer of your local AI stack.
FAQ
Is TabbyAPI production-ready?
It is stable enough for personal and team use, but review the project's readiness documentation for your specific deployment context.
Does TabbyAPI support function calling?
OpenAI-compatible function calling is supported when using the ExLlamaV2 backend with a model that supports tool-use format.
Can TabbyAPI run on CPU only?
Yes, via the Llama.cpp backend, though GPU inference is significantly faster.

