llama.cpp (`llama-server`)¶

The reference low-dependency local inference server. Its OpenAI endpoint is implemented in examples/server/server.cpp and has matured rapidly through 2025.

Surface (current)¶

Endpoint	Status	Notes
`/v1/models`	✅	Returns the loaded model + any preset entries
`/v1/models/{id}`	⚠️	404 in vanilla `llama-server`; LlamaSwap adds it
`/v1/chat/completions`	✅	Tools, JSON mode, schema-constrained sampling
`/v1/chat/completions` (stream)	✅	`[DONE]` sentinel, `usage` opt-in
`/v1/completions`	✅	Legacy text completion still maintained
`/v1/responses`	❌	Not implemented
`/v1/embeddings`	✅	Requires `--embeddings` flag at server start (returns 501 otherwise, which the prober reports as WARN)
`/v1/audio/speech`	❌	No TTS in vanilla; out of scope
`/v1/audio/transcriptions`	❌	Likewise
`/v1/images/generations`	❌	Out of scope
`/v1/videos`	❌	Out of scope

Notable extensions¶

cache_prompt in chat / completion bodies — keeps the prompt KV cache across requests when prefixes match. No spec analog; enormous latency win for chat UIs that pin a system prompt.
grammar / grammar_lazy — GBNF-constrained sampling. The catalog's response_format: json_schema falls through to this on llama.cpp.
mirostat, mirostat_tau, mirostat_eta — alternative sampler. Spec-silent; harmless extras.
repeat_penalty / repeat_last_n — repetition penalty and its lookback window. Note the spelling: a client that sends the vLLM-style repetition_penalty is silently ignored here (unknown field, no error). See Sampling parameters.
Multi-modal: when started with --mmproj, llama-server accepts image_url content parts in chat messages. The mmproj is loaded once at server start; you can't switch projectors per request.

Common deviations the catalog flags¶

/v1/models/{id} 404. WARN. Acceptable; clients rarely use it.
usage missing on streamed responses unless stream_options.include_usage: true is set. The catalog probes without that flag and tolerates absence on streams (PASS).
finish_reason: "length" for max_tokens cap. Spec correct.
finish_reason: "stop" even when the model emits its EOS. Correct per spec; some older shims used "eos".

Quirks worth knowing¶

--jinja matters. Without it, llama.cpp uses its own approximate chat templating, which can disagree with the model's trained format on tool-call syntax. With it, the server respects the GGUF's embedded jinja chat template. Always pass --jinja for modern instruction-tuned models.
n_parallel and KV-cache eviction. --parallel N lets N requests share the KV cache. Setting n_parallel=1 with --cont-batching gives lowest latency; setting n_parallel >= 4 helps throughput at the cost of cache thrash on long contexts.
/props for build introspection. Not OpenAI-compat, but the endpoint that tells you which build is running. Catalog ignores it; ops engineers love it.

How LlamaSwap layers on top¶

LlamaSwap wraps N llama-server instances behind one HTTP port. The OpenAI surface stays the same; what changes is that /v1/models enumerates all preset model ids, not just the loaded one. The catalog accepts this as a valid extension — status.value of "loaded" / "unloaded" is a documented extra field.

LlamaSwap's /v1/models/{id} works (it returns the preset) — so a LlamaSwap-fronted llama.cpp gets a PASS where vanilla gets a WARN.

Probing recipe¶

# Vanilla llama-server on 8080
aioc probe http://localhost:8080 --name llama.cpp

# LlamaSwap fronting llama-server
aioc probe http://localhost:8080 --name llamaswap

# In a Kubernetes setup with NodePort
aioc probe http://<node-ip>:<port> --name llama-cpp-k8s

llama.cpp (llama-server)¶