Tutorials

TabbyAPI Setup Guide: A FastAPI Inference Server for Local Models

Install and configure TabbyAPI, a lightweight FastAPI-based inference server for local LLMs with OpenAI-compatible endpoints and tool-calling support.

Robson PereiraMay 31, 20269 min read

TabbyAPI inference server terminal output showing model loading and API endpoints.

TabbyAPI Setup Guide: A FastAPI Inference Server for Local Models

TabbyAPI is a lightweight, FastAPI-based inference server designed for local LLMs. It provides OpenAI-compatible endpoints, built-in tool-calling support, and efficient memory management through a modular extension system. If you want a minimal server that exposes local models as an API without the overhead of a full chat interface, TabbyAPI is worth considering.

For a broader comparison of local runners, read Ollama vs vLLM vs llama.cpp: Choosing a Local Inference Engine.

Prerequisites

You need Python 3.10 or later, a compatible GPU with sufficient VRAM, and the CUDA toolkit installed. TabbyAPI works on Linux and Windows through WSL2. A CPU-only mode exists but is practical only for small models and testing.

Check your hardware against Best Hardware for Self-Hosted AI before choosing a model size.

Installation

Clone the TabbyAPI repository and install dependencies in a virtual environment:

```bash

git clone https://github.com/theroyallab/tabbyAPI

cd tabbyAPI

python -m venv venv

source venv/bin/activate

pip install -r requirements.txt

```

The installer pulls in exllamav2 as the backend, so model loading and inference use the same efficient kernel library.

Configuration

TabbyAPI uses a YAML configuration file for model paths, sampling parameters, and server settings. Create a config.yml with your model directory and default generation parameters.

Key settings include context length, GPU split configuration for multi-GPU setups, and cache size. Start with conservative values and increase once you confirm the system is stable.

Loading a model

TabbyAPI loads exllamav2-format models. Download a model in exl2 quantisation from Hugging Face, point the config at its directory, and start the server. The server loads the model into VRAM and exposes it at the /v1/chat/completions endpoint.

Loading a 8B-parameter model at 4-bit quantisation takes about 10–15 seconds on a modern GPU. Larger models or full-precision loads take longer.

API endpoints

TabbyAPI provides the standard OpenAI chat completions endpoint, making it compatible with Open WebUI, AnythingLLM, SillyTavern, and any tool that speaks the OpenAI API.

Tool calling and function calling work out of the box, which is a significant advantage over simpler runners. If you build agentic workflows that need structured outputs, TabbyAPI handles them natively.

Integrating with chat interfaces

Point Open WebUI or AnythingLLM at your TabbyAPI endpoint by setting the base URL and leaving the API key blank (or setting one in the TabbyAPI config). The interface communicates with your local model through the standard API, so you get the full chat UI experience backed by a lightweight server.

Conclusion

TabbyAPI fills a specific niche: a lightweight, OpenAI-compatible inference server with strong tool-calling support. It is not a full chat interface or a model manager — it is a focused backend that does one thing well. Use it when you want programmatic access to local models with minimal overhead.

FAQ

What model formats does TabbyAPI support?

TabbyAPI uses exl2 format models through the exllamav2 backend. Other formats require conversion.

Does TabbyAPI support multi-GPU?

Yes. You can split models across multiple GPUs using the GPU split configuration.

Is TabbyAPI production-ready?

It is stable for personal and small-team use. For production, add authentication, rate limiting, and monitoring.

TabbyAPI Setup Guide: A FastAPI Inference Server for Local Models

TabbyAPI Setup Guide: A FastAPI Inference Server for Local Models

Prerequisites

Installation

Configuration

Loading a model

API endpoints

Integrating with chat interfaces

Conclusion

FAQ

What model formats does TabbyAPI support?

Does TabbyAPI support multi-GPU?

Is TabbyAPI production-ready?

Related articles

How to Add Local Documents to Open WebUI with RAG and Ollama

How to Deploy Open WebUI and Ollama on a Private LAN with Docker Compose

How to Build a Self-Hosted AI Workstation with Docker and Multiple Model Runners