Tutorials
TabbyAPI Setup Guide: A FastAPI Inference Server for Local Models
Install and configure TabbyAPI, a lightweight FastAPI-based inference server for local LLMs with OpenAI-compatible endpoints and tool-calling support.

TabbyAPI Setup Guide: A FastAPI Inference Server for Local Models
TabbyAPI is a lightweight, FastAPI-based inference server designed for local LLMs. It provides OpenAI-compatible endpoints, built-in tool-calling support, and efficient memory management through a modular extension system. If you want a minimal server that exposes local models as an API without the overhead of a full chat interface, TabbyAPI is worth considering.
For a broader comparison of local runners, read Ollama vs vLLM vs llama.cpp: Choosing a Local Inference Engine.
Prerequisites
You need Python 3.10 or later, a compatible GPU with sufficient VRAM, and the CUDA toolkit installed. TabbyAPI works on Linux and Windows through WSL2. A CPU-only mode exists but is practical only for small models and testing.
Check your hardware against Best Hardware for Self-Hosted AI before choosing a model size.
Installation
Clone the TabbyAPI repository and install dependencies in a virtual environment:
```bash
git clone https://github.com/theroyallab/tabbyAPI
cd tabbyAPI
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
```
The installer pulls in exllamav2 as the backend, so model loading and inference use the same efficient kernel library.
Configuration
TabbyAPI uses a YAML configuration file for model paths, sampling parameters, and server settings. Create a config.yml with your model directory and default generation parameters.
Key settings include context length, GPU split configuration for multi-GPU setups, and cache size. Start with conservative values and increase once you confirm the system is stable.
Loading a model
TabbyAPI loads exllamav2-format models. Download a model in exl2 quantisation from Hugging Face, point the config at its directory, and start the server. The server loads the model into VRAM and exposes it at the /v1/chat/completions endpoint.
Loading a 8B-parameter model at 4-bit quantisation takes about 10–15 seconds on a modern GPU. Larger models or full-precision loads take longer.
API endpoints
TabbyAPI provides the standard OpenAI chat completions endpoint, making it compatible with Open WebUI, AnythingLLM, SillyTavern, and any tool that speaks the OpenAI API.
Tool calling and function calling work out of the box, which is a significant advantage over simpler runners. If you build agentic workflows that need structured outputs, TabbyAPI handles them natively.
Integrating with chat interfaces
Point Open WebUI or AnythingLLM at your TabbyAPI endpoint by setting the base URL and leaving the API key blank (or setting one in the TabbyAPI config). The interface communicates with your local model through the standard API, so you get the full chat UI experience backed by a lightweight server.
Conclusion
TabbyAPI fills a specific niche: a lightweight, OpenAI-compatible inference server with strong tool-calling support. It is not a full chat interface or a model manager — it is a focused backend that does one thing well. Use it when you want programmatic access to local models with minimal overhead.
FAQ
What model formats does TabbyAPI support?
TabbyAPI uses exl2 format models through the exllamav2 backend. Other formats require conversion.
Does TabbyAPI support multi-GPU?
Yes. You can split models across multiple GPUs using the GPU split configuration.
Is TabbyAPI production-ready?
It is stable for personal and small-team use. For production, add authentication, rate limiting, and monitoring.

