Voice Agents

Build production voice agents that are 80-90% cheaper than direct API calls -- with guaranteed sub-300ms latency, built-in memory, and zero model lock-in.

Build production voice agents that are 80-90% cheaper than direct API calls -- with guaranteed sub-300ms latency, built-in memory, and zero model lock-in.

Xantly is the infrastructure layer between your voice agent and LLM providers. We handle routing, caching, memory, and cost optimization so you can focus on building the product your customers love.

Why Xantly for Voice

Challenge	Without Xantly	With Xantly
Cost	$0.06+ per turn (GPT-4o)	$0.003-0.01 per turn (auto-routed)
Latency	800ms-2s (full pipeline)	<300ms TTFB (Fast Lane)
Model lock-in	Hardcoded to one provider	Auto-routes to cheapest qualifying model
Memory	You build it yourself	Built-in semantic memory across sessions
Scaling	You manage provider keys & rate limits	We handle failover, circuit breakers, rate limits

Drop-in compatible. If you already use OpenAI's chat completions API, you can switch to Xantly by changing one line -- your base URL.

Quick Start

Option 1: Voice header on chat completions (simplest)

Add the x-xantly-voice header to any chat completion request. Xantly automatically activates the Voice Engine -- sub-300ms routing, semantic caching, and cost optimization.

import openai

client = openai.OpenAI(
    api_key="your-xantly-api-key",
    base_url="https://api.xantly.com/v1"
)

response = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "What's my account balance?"}],
    extra_headers={"x-xantly-voice": "true"}
)

print(response.choices[0].message.content)

Option 2: Voice-native API (audio in, audio out)

For full audio workflows, use the dedicated voice endpoints:

curl -X POST https://api.xantly.com/v1/voice/chat \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -F "[email protected]" \
  -F "model=auto" \
  -F "voice=alloy" \
  -F "language=en" \
  -F "output_format=pcm_16000" \
  --output response.pcm

The response is raw audio bytes with metadata in headers:

X-Xantly-Transcript: What's my account balance?
X-Xantly-Response-Text: Your account balance is $1,234.56
X-Xantly-STT-Provider: deepgram
X-Xantly-TTS-Provider: openai
X-Xantly-STT-Latency-Ms: 95.2
X-Xantly-Inference-Latency-Ms: 48.7
X-Xantly-TTS-Latency-Ms: 187.3
X-Xantly-Lane-Used: FastLane
X-Xantly-Model-Used: groq/llama-3.3-70b
X-Xantly-Cost-USD: 0.000320
X-Xantly-Cache-Hit: false

How the Voice Engine Works

Two-Lane Hybrid Architecture

Every voice request is classified in under 1ms into one of two lanes:

Fast Lane -- Simple conversational requests (80% of voice traffic)

STT (~100ms for English via Deepgram, ~500ms for other languages via Whisper)
LLM inference via fastest available model (~50-100ms)
TTS (~100-200ms)
Total: <300ms end-to-end

Delegation Lane -- Complex multi-step tasks (20% of voice traffic)

Immediately generates a filler response ("Let me look that up...")
Executes the full workflow in the background via chain execution
Returns the real answer when ready
User never waits in silence

Lane classification uses Aho-Corasick multi-pattern matching (<1ms). Requests with multi-step keywords ("first...then", "calculate", "search for") or >150 words route to Delegation Lane.

Intelligent Model Routing

When you send model: "auto" (or set voice_fast_lane_model: "auto" in org settings), the Voice Engine selects the optimal model via BaRP (Bandit-Assisted Routing and Personalisation):

Real-time latency -- EWMA of actual inference times
Cost -- Routes to the cheapest model that meets the latency SLA
Circuit breakers -- Disables models that breach P95 latency thresholds
Per-tenant learning -- LinUCB bandit algorithm learns your traffic patterns
Voice-specific feedback -- Every voice inference submits latency, cost, and outcome back to BaRP, continuously optimizing model selection for sub-300ms response times

Multi-Provider STT

Language	Provider	Cost	Latency
English (`en-*`)	Deepgram Nova-2	$0.0043/min	~100ms
All others	OpenAI Whisper	$0.006/min	~500ms

Routing is automatic -- English audio goes to Deepgram (30% cheaper, lower latency), everything else goes to Whisper. If Deepgram is unavailable, all traffic falls back to Whisper.

Multi-Provider TTS

Provider	Model	Cost	Best for
OpenAI (default)	gpt-4o-mini-tts	$0.015/1K chars	General purpose, streaming
ElevenLabs (override)	eleven_multilingual_v2	$0.05/1K chars	Premium quality, voice cloning

Specify voice_provider: "elevenlabs" to use ElevenLabs. Default is OpenAI TTS.

Semantic Caching

The Voice Engine caches at the transcript level with two layers:

Layer	Lookup	Use case
Exact match (Redis)	SHA-256 fingerprint	Identical transcripts
Semantic match (Qdrant)	Cosine similarity + Jaccard safety	Similar transcripts with different wording

Voice transcripts are normalized before caching -- filler words ("um", "uh", "like"), repeated words ("the the"), and other STT artifacts are stripped. This means "um what's uh my balance?" hits the same cache entry as "what's my balance?".

Dedicated Voice API

STT Only: Transcribe audio to text

curl -X POST https://api.xantly.com/v1/voice/transcribe \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -F "[email protected]" \
  -F "language=en" \
  -F "stt_model=groq/whisper-large-v3-turbo"

Field	Required	Description
`audio`	Yes	Input audio file (WAV, PCM, MP3, OGG, WebM)
`language`	No	STT language hint, ISO-639-1 (default: `en`)
`stt_model` (or `model`)	No	STT model slug override (e.g. `groq/whisper-large-v3-turbo`, `openai/whisper-1`, `deepgram/nova-3`). Default: language-based auto-routing

Response:

{
  "text": "What is my account balance?",
  "provider": "deepgram",
  "language": "en",
  "cost_usd": 0.000168
}

TTS Only: Synthesize text to audio

curl -X POST https://api.xantly.com/v1/voice/synthesize \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Your account balance is $1,234.56",
    "voice": "alloy",
    "provider": "openai",
    "model": "openai/gpt-4o-mini-tts",
    "output_format": "pcm_16000"
  }' \
  --output response.pcm

Field	Required	Description
`text`	Yes	Text to synthesize
`voice`	No	Voice profile ID (default: org setting)
`provider`	No	TTS provider (`openai`, `elevenlabs`, `deepgram`, `groq`, `google`)
`model`	No	TTS model slug override (e.g. `elevenlabs/eleven_flash_v2_5`, `deepgram/aura-2`, `openai/tts-1-hd`). Default: provider-based auto-routing
`output_format`	No	`pcm_16000`, `mulaw`, `opus`, `webm`

Response: raw audio bytes with headers X-Xantly-TTS-Provider, X-Xantly-TTS-Model, X-Xantly-TTS-Latency-Ms, X-Xantly-Cost-USD.

Full Pipeline: Audio in, audio out

curl -X POST https://api.xantly.com/v1/voice/chat \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -F "[email protected]" \
  -F "voice=alloy" \
  -F "language=en" \
  -F "session_id=550e8400-e29b-41d4-a716-446655440000" \
  -F "output_format=pcm_16000" \
  -F "stt_model=groq/whisper-large-v3-turbo" \
  -F "tts_model=elevenlabs/eleven_flash_v2_5"

Field	Required	Description
`audio`	Yes	Input audio file (WAV, PCM, MP3, OGG, WebM)
`voice`	No	TTS voice profile (default: org setting)
`language`	No	STT language hint (default: `en`)
`session_id`	No	UUID for multi-turn conversation continuity
`output_format`	No	`pcm_16000`, `mulaw`, `opus`, `webm` (default: `pcm_16000`)
`stt_model` (or `model`)	No	STT model slug override — dispatches to the specific model instead of auto-routing
`tts_model`	No	TTS model slug override — dispatches to the specific TTS model

Response headers include full pipeline metadata (transcript, response text, per-stage latency, X-Xantly-STT-Model, X-Xantly-TTS-Model, provider used, cost, cache hit).

Single Turn: JSON voice interaction

For server-to-server integrations where you handle audio encoding yourself:

curl -X POST https://api.xantly.com/v1/voice/turn \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input": {"type": "transcript", "text": "What is my balance?"},
    "session_id": "optional-uuid",
    "voice_profile": "alloy",
    "latency_budget_ms": 300
  }'

Returns JSON with response_text, audio_base64, lane_used, model_used, latency breakdown, and cost.

Streaming: WebSocket bidirectional

const ws = new WebSocket("wss://api.xantly.com/v1/voice/realtime", {
  headers: { "Authorization": "Bearer " + XANTLY_API_KEY }
});

ws.onopen = () => {
  ws.send(JSON.stringify({
    input: { type: "transcript", text: "What's the weather like?" },
    voice_profile: "nova",
    latency_budget_ms: 300
  }));
};

ws.onmessage = (event) => {
  const data = JSON.parse(event.data);
  // data.audio_base64 -- play immediately
  // data.response_text -- display as caption
};

S2S Provider Proxying

For direct Speech-to-Speech via native multimodal models, Xantly proxies to upstream providers with automatic session management:

OpenAI Realtime: Set default_provider: "openai_realtime" in org voice settings. Xantly establishes a wss:// connection to OpenAI's Realtime API, sends session configuration, and relays bidirectional audio frames.

Gemini Live: Set default_provider: "gemini_live". Same proxy pattern to Google's BidiGenerateContent WebSocket.

Note: S2S proxy connections bypass Xantly's intelligence pipeline (caching, memory, routing). They are raw provider passthroughs with key management and session setup. Use them when you need native multimodal capabilities that don't exist in the text pipeline.

Voice Sessions

Voice sessions enable multi-turn conversations with context persistence.

Create a session

curl -X POST https://api.xantly.com/v1/voice/sessions \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"session_id": "optional-uuid"}'

Use session in requests

Pass session_id in your /v1/voice/chat or /v1/voice/turn requests. Xantly maintains:

Turn history: Last 5 turns cached for context injection
Chain linking: Active delegation chains persist across turns
Session TTL: 30 minutes of inactivity

Delete a session

curl -X DELETE https://api.xantly.com/v1/voice/sessions/{session_id} \
  -H "Authorization: Bearer $XANTLY_API_KEY"

Voice Cloning

Upload a voice sample to create a custom voice profile backed by ElevenLabs:

curl -X POST https://api.xantly.com/v1/voice/profiles/upload \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -F "profile_name=My Custom Voice" \
  -F "file=@voice_sample.wav"

The pipeline:

Audio file stored in Cloudflare R2
Sent to ElevenLabs voice cloning API
Returns a voice_id you can use in any TTS request

{
  "id": "uuid",
  "profile_name": "My Custom Voice",
  "provider": "elevenlabs",
  "voice_id": "abc123def456",
  "created_at": "2026-03-26T10:00:00Z"
}

Use the voice_id in subsequent requests: "voice": "abc123def456" with "provider": "elevenlabs".

List voice profiles

curl https://api.xantly.com/v1/voice/profiles \
  -H "Authorization: Bearer $XANTLY_API_KEY"

Create a voice profile (without cloning)

curl -X POST https://api.xantly.com/v1/voice/profiles \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"profile_name": "Support Agent", "provider": "openai", "voice_id": "nova"}'

Cost Controls

Per-stage cost breakdown

Every voice response includes a full cost breakdown:

Component	Typical cost	Cacheable?
STT (Deepgram)	$0.0043/min	No
STT (Whisper)	$0.006/min	No
LLM inference	$0.0001-0.003/turn	Yes (semantic cache)
TTS (OpenAI)	$0.015/1K chars	No
TTS (ElevenLabs)	$0.05/1K chars	No

On a cache hit, LLM inference cost drops to $0.00.

Plan-based voice limits

Voice quotas, rate limits, and concurrent session limits are enforced per plan. Per-org overrides are available via org_settings.

Plan	Voice RPM	Concurrent sessions	Monthly minutes	Platform fee / min
Free	3	1	3 min lifetime (one-time demo)	$0.00
Pro	60	5	500	$0.02
Scale	500	25	5,000	$0.015
Pay-As-You-Go	30	5	Unlimited (credit-bounded)	$0.025

Voice costs are pass-through (provider cost) plus the per-minute platform fee above. Per-org overrides:

Setting	Column on `org_settings`
RPM override	`voice_rpm_limit`
Concurrent session override	`voice_concurrent_session_limit`
Monthly minutes override	`voice_monthly_minutes_limit`
Platform fee override	`voice_platform_fee_per_min`
Monthly budget cap	`voice_monthly_budget_usd`

Quota warnings fire once at 80%, 90%, and 100% per billing period. When the monthly quota or budget cap is exceeded, requests return 429 Too Many Requests. When the concurrent session limit is reached, requests also return 429. When PAYG credit balance falls below the minimum floor ($0.05), voice requests return 402 Payment Required.

Migrating from VAPI / Direct API

Change your base URL to https://api.xantly.com/v1
Set your model to "auto" -- Xantly handles routing
Add the voice header -- x-xantly-voice: true
Remove your model-switching logic -- Xantly does this for you
Remove your caching layer -- Xantly's semantic cache is built in

Before (direct OpenAI)

client = openai.OpenAI(api_key="sk-...")
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": transcript}]
)

After (Xantly)

client = openai.OpenAI(
    api_key="xantly-key",
    base_url="https://api.xantly.com/v1"
)
response = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": transcript}],
    extra_headers={"x-xantly-voice": "true"}
)

Same SDK. Same code pattern. 80-90% less cost.

Response Headers Reference

Header	Description
`X-Xantly-Transcript`	STT transcript of input audio (truncated to 500 chars)
`X-Xantly-Response-Text`	LLM response text (truncated to 500 chars)
`X-Xantly-STT-Provider`	The STT provider that served this request (`deepgram`, `openai`, `groq`, `elevenlabs`)
`X-Xantly-TTS-Provider`	The TTS provider that served this request (`openai`, `elevenlabs`, `deepgram`, `groq`, `google`)
`X-Xantly-STT-Model`	Specific STT model slug used (e.g. `groq/whisper-large-v3-turbo`, `openai/whisper-1`)
`X-Xantly-TTS-Model`	Specific TTS model slug used (e.g. `elevenlabs/eleven_flash_v2_5`, `openai/gpt-4o-mini-tts`)
`X-Xantly-STT-Latency-Ms`	Speech-to-text stage latency
`X-Xantly-Inference-Latency-Ms`	LLM inference stage latency
`X-Xantly-TTS-Latency-Ms`	Text-to-speech stage latency
`X-Xantly-Lane-Used`	`FastLane` or `DelegationLane`
`X-Xantly-Model-Used`	Provider/model that served inference
`X-Xantly-Cost-USD`	Total cost for this request (provider pass-through + platform fee)
`X-Xantly-Cache-Hit`	`true` if served from semantic cache

What's Next

Cost-Optimized Routing -- Fine-tune routing behavior with hints and overrides
Streaming Responses -- Enable SSE streaming for text responses
Chat Completions API -- Full API reference for the unified endpoint

Frequently Asked Questions

How much cheaper is Xantly for voice?

Xantly reduces voice agent costs by 80-90% compared to direct API calls — from $0.06+ per turn with GPT-4o to $0.003-0.01 per turn with auto-routing. Cost savings come from intelligent model selection (routing simple conversational requests to fast, cheap models), semantic caching at the transcript level (eliminating redundant LLM calls), and transcript normalization that strips filler words to maximize cache hit rates.

What voice models does Xantly support?

Xantly supports 30+ voice models across the full pipeline: STT models include Whisper (OpenAI, Groq), Deepgram Nova-2/Nova-3, and ElevenLabs Scribe; TTS models include OpenAI gpt-4o-mini-tts, ElevenLabs (eleven_multilingual_v2, eleven_flash_v2_5), Deepgram Aura-2, and Google Cloud TTS; plus Realtime bidirectional audio (OpenAI Realtime, Gemini Live) and Audio LLMs for native multimodal processing.

What latency can I expect?

Xantly's Fast Lane delivers sub-300ms end-to-end latency for 80% of voice requests: ~100ms STT (Deepgram for English), ~50-100ms LLM inference via the fastest available model, and ~100-200ms TTS. Complex multi-step requests route to the Delegation Lane, which immediately returns a filler response so the user never waits in silence, then executes the full workflow in the background.

Can I use my own voice provider API keys?

Yes. Xantly supports Bring Your Own Key (BYOK) for all voice providers — including OpenAI, Deepgram, ElevenLabs, Groq, and Google. BYOK lets you use your own billing relationship and negotiated rates while still benefiting from Xantly's intelligent routing, semantic caching, and automatic failover across providers.

On this page