llama.cpp (llama-server)¶
The reference low-dependency local inference server. Its OpenAI
endpoint is implemented in examples/server/server.cpp and has
matured rapidly through 2025.
Surface (current)¶
| Endpoint | Status | Notes |
|---|---|---|
/v1/models |
✅ | Returns the loaded model + any preset entries |
/v1/models/{id} |
⚠️ | 404 in vanilla llama-server; LlamaSwap adds it |
/v1/chat/completions |
✅ | Tools, JSON mode, schema-constrained sampling |
/v1/chat/completions (stream) |
✅ | [DONE] sentinel, usage opt-in |
/v1/completions |
✅ | Legacy text completion still maintained |
/v1/responses |
❌ | Not implemented |
/v1/embeddings |
✅ | Requires --embeddings flag at server start (returns 501 otherwise, which the prober reports as WARN) |
/v1/audio/speech |
❌ | No TTS in vanilla; out of scope |
/v1/audio/transcriptions |
❌ | Likewise |
/v1/images/generations |
❌ | Out of scope |
/v1/videos |
❌ | Out of scope |
Notable extensions¶
cache_promptin chat / completion bodies — keeps the prompt KV cache across requests when prefixes match. No spec analog; enormous latency win for chat UIs that pin a system prompt.grammar/grammar_lazy— GBNF-constrained sampling. The catalog'sresponse_format: json_schemafalls through to this on llama.cpp.mirostat,mirostat_tau,mirostat_eta— alternative sampler. Spec-silent; harmless extras.- Multi-modal: when started with
--mmproj,llama-serveracceptsimage_urlcontent parts in chat messages. The mmproj is loaded once at server start; you can't switch projectors per request.
Common deviations the catalog flags¶
/v1/models/{id}404. WARN. Acceptable; clients rarely use it.usagemissing on streamed responses unlessstream_options.include_usage: trueis set. The catalog probes without that flag and tolerates absence on streams (PASS).finish_reason: "length"formax_tokenscap. Spec correct.finish_reason: "stop"even when the model emits its EOS. Correct per spec; some older shims used"eos".
Quirks worth knowing¶
--jinjamatters. Without it, llama.cpp uses its own approximate chat templating, which can disagree with the model's trained format on tool-call syntax. With it, the server respects the GGUF's embedded jinja chat template. Always pass--jinjafor modern instruction-tuned models.n_paralleland KV-cache eviction.--parallel Nlets N requests share the KV cache. Settingn_parallel=1with--cont-batchinggives lowest latency; settingn_parallel >= 4helps throughput at the cost of cache thrash on long contexts./propsfor build introspection. Not OpenAI-compat, but the endpoint that tells you which build is running. Catalog ignores it; ops engineers love it.
How LlamaSwap layers on top¶
LlamaSwap wraps N llama-server
instances behind one HTTP port. The OpenAI surface stays the same; what
changes is that /v1/models enumerates all preset model ids, not
just the loaded one. The catalog accepts this as a valid extension —
status.value of "loaded" / "unloaded" is a documented extra
field.
LlamaSwap's /v1/models/{id} works (it returns the preset) — so a
LlamaSwap-fronted llama.cpp gets a PASS where vanilla gets a WARN.
Probing recipe¶
# Vanilla llama-server on 8080
aioc probe http://localhost:8080 --name llama.cpp
# LlamaSwap fronting llama-server
aioc probe http://localhost:8080 --name llamaswap
# In a Kubernetes setup with NodePort
aioc probe http://192.168.8.158:30184 --name titan-llm
See also¶
- upstream: https://github.com/ggerganov/llama.cpp
- heiervang-technologies fork: https://github.com/heiervang-technologies/ht-llama.cpp