XantlyANTLY
Guides

Streaming Responses

Stream tokens from any model as they are generated using standard Server-Sent Events (SSE). Works with every OpenAI-compatible SDK — just set stream: true.

Stream tokens from any model as they are generated using standard Server-Sent Events (SSE). Works with every OpenAI-compatible SDK — just set stream: true.


Enabling Streaming

Add "stream": true to your request body:

curl -sS https://api.xantly.com/v1/chat/completions \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "stream": true,
    "messages": [
      {"role": "user", "content": "Write a short poem about distributed systems."}
    ]
  }'

The response is a text/event-stream with chat.completion.chunk events:

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","created":1741400100,"model":"deepseek-chat","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","created":1741400100,"model":"deepseek-chat","choices":[{"index":0,"delta":{"content":"Packets"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","created":1741400100,"model":"deepseek-chat","choices":[{"index":0,"delta":{"content":" scatter"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","created":1741400100,"model":"deepseek-chat","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

SSE Format

Every stream event follows the standard SSE framing:

PartDescription
data: <json>A ChatCompletionChunk JSON object
data: [DONE]Terminal sentinel — always the last event

ChatCompletionChunk fields

FieldTypeDescription
idstringShared across all chunks in the same stream
objectstringAlways "chat.completion.chunk"
createdintegerUnix timestamp of the stream start
modelstringModel that served the request
choicesarrayArray with one ChunkChoice per n
choices[].indexintegerChoice index (0-based)
choices[].deltaobjectIncremental content — may have role, content, tool_calls, or be empty {}
choices[].finish_reasonstring?null until the final chunk; then "stop", "length", "tool_calls", etc.

Streaming semantics: Non-voice stream mode is SSE-compatible but not guaranteed token-by-token. The gateway may batch tokens before flushing for efficiency on some provider paths.


Handling Stream Chunks

Python (openai SDK)

from openai import OpenAI

client = OpenAI(
    api_key="your-xantly-key",
    base_url="https://api.xantly.com/v1"
)

stream = client.chat.completions.create(
    model="auto",
    stream=True,
    messages=[{"role": "user", "content": "Explain async I/O in 3 bullets."}]
)

for chunk in stream:
    delta = chunk.choices[0].delta
    if delta.content:
        print(delta.content, end="", flush=True)

print()  # newline after stream ends

Node.js (openai SDK)

import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: process.env.XANTLY_API_KEY,
  baseURL: 'https://api.xantly.com/v1',
});

const stream = await client.chat.completions.stream({
  model: 'auto',
  stream: true,
  messages: [{ role: 'user', content: 'Explain async I/O in 3 bullets.' }],
});

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content ?? '';
  process.stdout.write(content);
}

console.log(); // newline

curl with manual parsing

curl -sS https://api.xantly.com/v1/chat/completions \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"auto","stream":true,"messages":[{"role":"user","content":"Hello"}]}' \
  | while IFS= read -r line; do
      if [[ "$line" == data:* ]]; then
        payload="${line#data: }"
        [[ "$payload" == "[DONE]" ]] && break
        echo "$payload" | python3 -c "
import sys, json
d = json.load(sys.stdin)
c = d['choices'][0]['delta'].get('content','')
print(c, end='', flush=True)
"
      fi
    done

Stream Options

Include usage in the final chunk

Set stream_options.include_usage to receive a terminal usage chunk before [DONE]:

{
  "model": "auto",
  "stream": true,
  "stream_options": {
    "include_usage": true
  },
  "messages": [...]
}

When supported by the provider, the last data: chunk before [DONE] will contain a usage field:

{
  "id": "chatcmpl-abc",
  "object": "chat.completion.chunk",
  "choices": [],
  "usage": {
    "prompt_tokens": 31,
    "completion_tokens": 87,
    "total_tokens": 118
  }
}

stream_options.include_usage is forwarded to providers where supported. A terminal usage chunk is not guaranteed on all provider paths — do not treat its absence as an error.


Error Handling in Streams

If an error occurs before the stream starts, you receive a standard 4xx or 5xx JSON response (not SSE). If an error occurs mid-stream, the stream may terminate early without a [DONE] event.

Pre-stream errors

HTTP 400
{
  "error": {
    "message": "temperature (2.5) must be between 0 and 2",
    "type": "invalid_request_error",
    "code": "validation_error"
  }
}

Detecting truncated streams

Always check finish_reason on the last chunk:

finish_reasonMeaning
stopNormal completion
lengthHit max_tokens limit — output may be truncated
tool_callsModel wants to call a tool
content_filterOutput filtered by provider
nullStream may have been cut short by an error

If you reach [DONE] without a chunk carrying a non-null finish_reason, treat the output as potentially incomplete.


SDK Examples

LangChain (Python)

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="auto",
    openai_api_key="your-xantly-key",
    openai_api_base="https://api.xantly.com/v1",
    streaming=True,
)

for chunk in llm.stream("Summarize streaming protocols in 2 sentences."):
    print(chunk.content, end="", flush=True)

Vercel AI SDK (TypeScript)

import { createOpenAI } from '@ai-sdk/openai';
import { streamText } from 'ai';

const xantly = createOpenAI({
  apiKey: process.env.XANTLY_API_KEY!,
  baseURL: 'https://api.xantly.com/v1',
});

const result = await streamText({
  model: xantly('auto'),
  prompt: 'Summarize streaming protocols in 2 sentences.',
});

for await (const textPart of result.textStream) {
  process.stdout.write(textPart);
}

LiteLLM

import litellm

response = litellm.completion(
    model="openai/auto",
    api_base="https://api.xantly.com/v1",
    api_key="your-xantly-key",
    stream=True,
    messages=[{"role": "user", "content": "Hello, stream this."}]
)

for chunk in response:
    delta = chunk.choices[0].delta.content or ""
    print(delta, end="", flush=True)

On this page