Chat & completions¶
The most heavily used surface and the one with the most quiet drift.
/v1/chat/completions — non-streaming¶
Required request fields:
Required response fields:
{
"id": "chatcmpl-...",
"object": "chat.completion",
"created": 1730000000,
"model": "<id>",
"choices": [
{
"index": 0,
"message": {"role": "assistant", "content": "..."},
"finish_reason": "stop|length|tool_calls|content_filter"
}
],
"usage": {
"prompt_tokens": 10,
"completion_tokens": 4,
"total_tokens": 14
}
}
The prober's
ChatCompletionResponse
treats usage as optional because llama.cpp omits it in some
configurations and that's not worth a hard FAIL.
Common deviations¶
- Missing
usageon stream +usage_chunks=false— llama.cpp. Easy fix: passstream_options: {include_usage: true}if the server supports it. finish_reason: "eos"instead of"stop"— older llama.cpp shims; modern builds emit"stop".- No
choices[].index— exotic. HardFAIL. - Tool calls returned as a string rather than the structured
tool_callsarray — some Ollama versions when the model wasn't trained for tool use; reportable but not a server defect per se.
Validation rules the prober applies¶
choicesmust be a non-empty array.choices[0].message.contentmust be a string or an array of content parts.choices[0].finish_reasonmust be one of the canonical values ornull(some servers emitnullmid-stream — but for non-streaming the prober expects a final value).usage, when present, must containprompt_tokensandtotal_tokens.completion_tokensis allowed to benull.
/v1/chat/completions — streaming¶
A separate catalog row so a missing-stream regression shows up distinctly. Probe sends:
Server must respond with text/event-stream framing:
data: {"id": "chatcmpl-...", "object": "chat.completion.chunk",
"created": 1730000000, "model": "<id>",
"choices": [{"index": 0, "delta": {"role": "assistant"},
"finish_reason": null}]}
data: {"...": "...", "choices": [{"index": 0, "delta": {"content": "h"}}]}
...
data: [DONE]
The prober counts at least 1 chunk and sees [DONE]. Servers that
forget the [DONE] sentinel (a real Ollama bug for a while) get a
WARN.
Stream-specific deviations¶
- No
[DONE]line. Clients usingopenai-python1.x usually cope; bare-bones SSE clients may hang waiting for a final frame. roleonly on the first delta. This is canonical; servers that repeat it on every delta are still spec but inflate bytes.- Last delta carries the full message instead of a single token. Spec doesn't forbid it; some optimizers do this when the model output is shorter than the streaming flush window.
include_usageignored. Servers that don't supportstream_options.include_usageshould ignore it without erroring. vLLM and llama.cpp do the right thing; some shims 400 the request.
/v1/completions (legacy text completion)¶
ext in this catalog. Many newer servers (vLLM ≥ 0.5, Ollama after
the OpenAI-compat refactor) keep it for backward compatibility but
mark it deprecated.
The expected shape:
{
"id": "cmpl-...",
"object": "text_completion",
"created": 1730000000,
"model": "<id>",
"choices": [{"text": "...", "index": 0, "finish_reason": "..."}]
}
Servers that "implement" /v1/completions by silently rerouting to
chat completions and returning the chat shape get a FAIL — the
shapes are different and clients break. This is rare but does happen.
/v1/responses (the newer Responses API)¶
ext. Almost no OSS server implements this fully today. The prober
sends:
…and validates that output exists in the response. A 404 here is
expected and yields SKIP, not FAIL.