TabbyAPI¶

Exllamav2-based inference server with an OpenAI-compatible surface. Targeted at users who want EXL2 quantization performance with a familiar API.

Surface (current)¶

Endpoint	Status	Notes
`/v1/models`	✅	Lists loaded model and adjacent presets
`/v1/models/{id}`	✅
`/v1/chat/completions`	✅	Tools, JSON mode
`/v1/chat/completions` (stream)	✅
`/v1/completions`	✅
`/v1/embeddings`	❌	Not the focus of the project
`/v1/audio/*`	❌
`/v1/images/*`	❌

top_a — alternative sampler. Spec-silent.
DRY sampler — discourages repeated phrases. Spec-silent.
Speculative decoding — exposed as n_speculative_tokens and related params; runtime-only, no spec analog.
Multi-model routing — load multiple models in one process; request specifies which one.

Sampler defaults differ from OpenAI. TabbyAPI's defaults lean toward narrative writing (lower top_p, higher temperature) than OpenAI's. The compliance probe doesn't care, but a client that doesn't set its own params will see different style than they expect.