Streaming Responses

Stream tokens from any model as they are generated using standard Server-Sent Events (SSE). Works with every OpenAI-compatible SDK — just set stream: true.

Stream tokens from any model as they are generated using standard Server-Sent Events (SSE). Works with every OpenAI-compatible SDK — just set stream: true.

Enabling Streaming

Add "stream": true to your request body:

curl -sS https://api.xantly.com/v1/chat/completions \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "stream": true,
    "messages": [
      {"role": "user", "content": "Write a short poem about distributed systems."}
    ]
  }'

The response is a text/event-stream with chat.completion.chunk events:

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","created":1741400100,"model":"deepseek-chat","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","created":1741400100,"model":"deepseek-chat","choices":[{"index":0,"delta":{"content":"Packets"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","created":1741400100,"model":"deepseek-chat","choices":[{"index":0,"delta":{"content":" scatter"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","created":1741400100,"model":"deepseek-chat","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

SSE Format

Every stream event follows the standard SSE framing:

Part	Description
`data: <json>`	A `ChatCompletionChunk` JSON object
`data: [DONE]`	Terminal sentinel — always the last event

`ChatCompletionChunk` fields

Field	Type	Description
`id`	`string`	Shared across all chunks in the same stream
`object`	`string`	Always `"chat.completion.chunk"`
`created`	`integer`	Unix timestamp of the stream start
`model`	`string`	Model that served the request
`choices`	`array`	Array with one `ChunkChoice` per `n`
`choices[].index`	`integer`	Choice index (0-based)
`choices[].delta`	`object`	Incremental content — may have `role`, `content`, `tool_calls`, or be empty `{}`
`choices[].finish_reason`	`string?`	`null` until the final chunk; then `"stop"`, `"length"`, `"tool_calls"`, etc.

Streaming semantics: Non-voice stream mode is SSE-compatible but not guaranteed token-by-token. The gateway may batch tokens before flushing for efficiency on some provider paths.

Handling Stream Chunks

Python (openai SDK)

from openai import OpenAI

client = OpenAI(
    api_key="your-xantly-key",
    base_url="https://api.xantly.com/v1"
)

stream = client.chat.completions.create(
    model="auto",
    stream=True,
    messages=[{"role": "user", "content": "Explain async I/O in 3 bullets."}]
)

for chunk in stream:
    delta = chunk.choices[0].delta
    if delta.content:
        print(delta.content, end="", flush=True)

print()  # newline after stream ends

Node.js (openai SDK)

import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: process.env.XANTLY_API_KEY,
  baseURL: 'https://api.xantly.com/v1',
});

const stream = await client.chat.completions.stream({
  model: 'auto',
  stream: true,
  messages: [{ role: 'user', content: 'Explain async I/O in 3 bullets.' }],
});

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content ?? '';
  process.stdout.write(content);
}

console.log(); // newline

curl with manual parsing

curl -sS https://api.xantly.com/v1/chat/completions \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"auto","stream":true,"messages":[{"role":"user","content":"Hello"}]}' \
  | while IFS= read -r line; do
      if [[ "$line" == data:* ]]; then
        payload="${line#data: }"
        [[ "$payload" == "[DONE]" ]] && break
        echo "$payload" | python3 -c "
import sys, json
d = json.load(sys.stdin)
c = d['choices'][0]['delta'].get('content','')
print(c, end='', flush=True)
"
      fi
    done

Stream Options

Include usage in the final chunk

Set stream_options.include_usage to receive a terminal usage chunk before [DONE]:

{
  "model": "auto",
  "stream": true,
  "stream_options": {
    "include_usage": true
  },
  "messages": [...]
}

When supported by the provider, the last data: chunk before [DONE] will contain a usage field:

{
  "id": "chatcmpl-abc",
  "object": "chat.completion.chunk",
  "choices": [],
  "usage": {
    "prompt_tokens": 31,
    "completion_tokens": 87,
    "total_tokens": 118
  }
}

stream_options.include_usage is forwarded to providers where supported. A terminal usage chunk is not guaranteed on all provider paths — do not treat its absence as an error.

Error Handling in Streams

If an error occurs before the stream starts, you receive a standard 4xx or 5xx JSON response (not SSE). If an error occurs mid-stream, the stream may terminate early without a [DONE] event.

Pre-stream errors

HTTP 400
{
  "error": {
    "message": "temperature (2.5) must be between 0 and 2",
    "type": "invalid_request_error",
    "code": "validation_error"
  }
}

Detecting truncated streams

Always check finish_reason on the last chunk:

`finish_reason`	Meaning
`stop`	Normal completion
`length`	Hit `max_tokens` limit — output may be truncated
`tool_calls`	Model wants to call a tool
`content_filter`	Output filtered by provider
`null`	Stream may have been cut short by an error

If you reach [DONE] without a chunk carrying a non-null finish_reason, treat the output as potentially incomplete.

SDK Examples

LangChain (Python)

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="auto",
    openai_api_key="your-xantly-key",
    openai_api_base="https://api.xantly.com/v1",
    streaming=True,
)

for chunk in llm.stream("Summarize streaming protocols in 2 sentences."):
    print(chunk.content, end="", flush=True)

Vercel AI SDK (TypeScript)

import { createOpenAI } from '@ai-sdk/openai';
import { streamText } from 'ai';

const xantly = createOpenAI({
  apiKey: process.env.XANTLY_API_KEY!,
  baseURL: 'https://api.xantly.com/v1',
});

const result = await streamText({
  model: xantly('auto'),
  prompt: 'Summarize streaming protocols in 2 sentences.',
});

for await (const textPart of result.textStream) {
  process.stdout.write(textPart);
}

LiteLLM

import litellm

response = litellm.completion(
    model="openai/auto",
    api_base="https://api.xantly.com/v1",
    api_key="your-xantly-key",
    stream=True,
    messages=[{"role": "user", "content": "Hello, stream this."}]
)

for chunk in response:
    delta = chunk.choices[0].delta.content or ""
    print(delta, end="", flush=True)

Chat Completions API Reference — full parameter reference including stream_options
Benchmark Results — streaming validated across 10 SDK clients

On this page