Tutorials

TabbyAPI Quick Start: Deploy an OpenAI-Compatible Local API Server

Download, configure, and run TabbyAPI as a lightweight OpenAI-compatible inference server for local LLMs on Linux and Windows.

Robson PereiraMay 31, 20267 min read
TabbyAPI terminal interface serving local language model requests.

TabbyAPI Quick Start: Deploy an OpenAI-Compatible Local API Server

TabbyAPI is a lightweight, Rust-based inference server that exposes an OpenAI-compatible API for local language models. If you want a minimal, fast backend that follows the OpenAI API format closely, this is worth evaluating alongside the bigger toolkits.

What makes TabbyAPI different

TabbyAPI is built around speed and API compliance. It uses ExLlamaV2 and Llama.cpp backends, keeps memory overhead low, and provides the standard `/v1/chat/completions` and `/v1/completions` endpoints without requiring extensions or additional configuration.

For a broader comparison of local servers, see TabbyAPI vs text-generation-webui.

Installation

Download the pre-built binary from the releases page or build from source with Cargo. On Linux, the binary is self-contained and requires only the CUDA or ROCm runtime for GPU acceleration. Place the binary in a directory on your `PATH` for convenient access.

Configure and run

Create a configuration file specifying the model path, backend (ExLlamaV2 or Llama.cpp), and inference parameters. TabbyAPI loads the model during startup and begins listening on the configured port.

First inference test

Once the server is running, send a test request with curl:

```bash

curl http://localhost:8000/v1/chat/completions \

-H "Content-Type: application/json" \

-d '{"model":"default","messages":[{"role":"user","content":"Hello"}]}'

```

Connect to Open WebUI

TabbyAPI works directly with Open WebUI as an OpenAI-compatible backend. In Open WebUI's admin settings, add a new OpenAI connection pointing to `http://localhost:8000/v1`. Models served by TabbyAPI will appear in the model selector.

Read Open WebUI Setup for Local Documents for the full interface configuration.

Performance tuning

Batch size and context length

Adjust the batch size and maximum context length in the configuration file to match your GPU memory. Larger batches improve throughput for concurrent requests; longer context windows increase memory usage per request.

Keep multiple models ready

TabbyAPI supports loading multiple model configurations and switching between them via API calls. You can keep a fast model for simple queries and a larger model for complex reasoning tasks.

Conclusion

TabbyAPI fills a specific niche: a fast, minimal OpenAI-compatible server for local LLMs. If you value API compliance and low overhead over a web interface and extensions, it is a strong choice for the inference layer of your local AI stack.

FAQ

Is TabbyAPI production-ready?

It is stable enough for personal and team use, but review the project's readiness documentation for your specific deployment context.

Does TabbyAPI support function calling?

OpenAI-compatible function calling is supported when using the ExLlamaV2 backend with a model that supports tool-use format.

Can TabbyAPI run on CPU only?

Yes, via the Llama.cpp backend, though GPU inference is significantly faster.

Related articles