Voice Agents
Build production voice agents that are 80-90% cheaper than direct API calls -- with guaranteed sub-300ms latency, built-in memory, and zero model lock-in.
Build production voice agents that are 80-90% cheaper than direct API calls -- with guaranteed sub-300ms latency, built-in memory, and zero model lock-in.
Xantly is the infrastructure layer between your voice agent and LLM providers. We handle routing, caching, memory, and cost optimization so you can focus on building the product your customers love.
Why Xantly for Voice
| Challenge | Without Xantly | With Xantly |
|---|---|---|
| Cost | $0.06+ per turn (GPT-4o) | $0.003-0.01 per turn (auto-routed) |
| Latency | 800ms-2s (full pipeline) | <300ms TTFB (Fast Lane) |
| Model lock-in | Hardcoded to one provider | Auto-routes to cheapest qualifying model |
| Memory | You build it yourself | Built-in semantic memory across sessions |
| Scaling | You manage provider keys & rate limits | We handle failover, circuit breakers, rate limits |
Drop-in compatible. If you already use OpenAI's chat completions API, you can switch to Xantly by changing one line -- your base URL.
Quick Start
Option 1: Voice header on chat completions (simplest)
Add the x-xantly-voice header to any chat completion request. Xantly automatically activates the Voice Engine -- sub-300ms routing, semantic caching, and cost optimization.
import openai
client = openai.OpenAI(
api_key="your-xantly-api-key",
base_url="https://api.xantly.com/v1"
)
response = client.chat.completions.create(
model="auto",
messages=[{"role": "user", "content": "What's my account balance?"}],
extra_headers={"x-xantly-voice": "true"}
)
print(response.choices[0].message.content)Option 2: Voice-native API (audio in, audio out)
For full audio workflows, use the dedicated voice endpoints:
curl -X POST https://api.xantly.com/v1/voice/chat \
-H "Authorization: Bearer $XANTLY_API_KEY" \
-F "[email protected]" \
-F "model=auto" \
-F "voice=alloy" \
-F "language=en" \
-F "output_format=pcm_16000" \
--output response.pcmThe response is raw audio bytes with metadata in headers:
X-Xantly-Transcript: What's my account balance?
X-Xantly-Response-Text: Your account balance is $1,234.56
X-Xantly-STT-Provider: deepgram
X-Xantly-TTS-Provider: openai
X-Xantly-STT-Latency-Ms: 95.2
X-Xantly-Inference-Latency-Ms: 48.7
X-Xantly-TTS-Latency-Ms: 187.3
X-Xantly-Lane-Used: FastLane
X-Xantly-Model-Used: groq/llama-3.3-70b
X-Xantly-Cost-USD: 0.000320
X-Xantly-Cache-Hit: falseHow the Voice Engine Works
Two-Lane Hybrid Architecture
Every voice request is classified in under 1ms into one of two lanes:
Fast Lane -- Simple conversational requests (80% of voice traffic)
- STT (~100ms for English via Deepgram, ~500ms for other languages via Whisper)
- LLM inference via fastest available model (~50-100ms)
- TTS (~100-200ms)
- Total: <300ms end-to-end
Delegation Lane -- Complex multi-step tasks (20% of voice traffic)
- Immediately generates a filler response ("Let me look that up...")
- Executes the full workflow in the background via chain execution
- Returns the real answer when ready
- User never waits in silence
Lane classification uses Aho-Corasick multi-pattern matching (<1ms). Requests with multi-step keywords ("first...then", "calculate", "search for") or >150 words route to Delegation Lane.
Intelligent Model Routing
When you send model: "auto" (or set voice_fast_lane_model: "auto" in org settings), the Voice Engine selects the optimal model via BaRP (Bandit-Assisted Routing and Personalisation):
- Real-time latency -- EWMA of actual inference times
- Cost -- Routes to the cheapest model that meets the latency SLA
- Circuit breakers -- Disables models that breach P95 latency thresholds
- Per-tenant learning -- LinUCB bandit algorithm learns your traffic patterns
- Voice-specific feedback -- Every voice inference submits latency, cost, and outcome back to BaRP, continuously optimizing model selection for sub-300ms response times
Multi-Provider STT
| Language | Provider | Cost | Latency |
|---|---|---|---|
English (en-*) | Deepgram Nova-2 | $0.0043/min | ~100ms |
| All others | OpenAI Whisper | $0.006/min | ~500ms |
Routing is automatic -- English audio goes to Deepgram (30% cheaper, lower latency), everything else goes to Whisper. If Deepgram is unavailable, all traffic falls back to Whisper.
Multi-Provider TTS
| Provider | Model | Cost | Best for |
|---|---|---|---|
| OpenAI (default) | gpt-4o-mini-tts | $0.015/1K chars | General purpose, streaming |
| ElevenLabs (override) | eleven_multilingual_v2 | $0.05/1K chars | Premium quality, voice cloning |
Specify voice_provider: "elevenlabs" to use ElevenLabs. Default is OpenAI TTS.
Semantic Caching
The Voice Engine caches at the transcript level with two layers:
| Layer | Lookup | Use case |
|---|---|---|
| Exact match (Redis) | SHA-256 fingerprint | Identical transcripts |
| Semantic match (Qdrant) | Cosine similarity + Jaccard safety | Similar transcripts with different wording |
Voice transcripts are normalized before caching -- filler words ("um", "uh", "like"), repeated words ("the the"), and other STT artifacts are stripped. This means "um what's uh my balance?" hits the same cache entry as "what's my balance?".
Dedicated Voice API
STT Only: Transcribe audio to text
curl -X POST https://api.xantly.com/v1/voice/transcribe \
-H "Authorization: Bearer $XANTLY_API_KEY" \
-F "[email protected]" \
-F "language=en" \
-F "stt_model=groq/whisper-large-v3-turbo"| Field | Required | Description |
|---|---|---|
audio | Yes | Input audio file (WAV, PCM, MP3, OGG, WebM) |
language | No | STT language hint, ISO-639-1 (default: en) |
stt_model (or model) | No | STT model slug override (e.g. groq/whisper-large-v3-turbo, openai/whisper-1, deepgram/nova-3). Default: language-based auto-routing |
Response:
{
"text": "What is my account balance?",
"provider": "deepgram",
"language": "en",
"cost_usd": 0.000168
}TTS Only: Synthesize text to audio
curl -X POST https://api.xantly.com/v1/voice/synthesize \
-H "Authorization: Bearer $XANTLY_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"text": "Your account balance is $1,234.56",
"voice": "alloy",
"provider": "openai",
"model": "openai/gpt-4o-mini-tts",
"output_format": "pcm_16000"
}' \
--output response.pcm| Field | Required | Description |
|---|---|---|
text | Yes | Text to synthesize |
voice | No | Voice profile ID (default: org setting) |
provider | No | TTS provider (openai, elevenlabs, deepgram, groq, google) |
model | No | TTS model slug override (e.g. elevenlabs/eleven_flash_v2_5, deepgram/aura-2, openai/tts-1-hd). Default: provider-based auto-routing |
output_format | No | pcm_16000, mulaw, opus, webm |
Response: raw audio bytes with headers X-Xantly-TTS-Provider, X-Xantly-TTS-Model, X-Xantly-TTS-Latency-Ms, X-Xantly-Cost-USD.
Full Pipeline: Audio in, audio out
curl -X POST https://api.xantly.com/v1/voice/chat \
-H "Authorization: Bearer $XANTLY_API_KEY" \
-F "[email protected]" \
-F "voice=alloy" \
-F "language=en" \
-F "session_id=550e8400-e29b-41d4-a716-446655440000" \
-F "output_format=pcm_16000" \
-F "stt_model=groq/whisper-large-v3-turbo" \
-F "tts_model=elevenlabs/eleven_flash_v2_5"| Field | Required | Description |
|---|---|---|
audio | Yes | Input audio file (WAV, PCM, MP3, OGG, WebM) |
voice | No | TTS voice profile (default: org setting) |
language | No | STT language hint (default: en) |
session_id | No | UUID for multi-turn conversation continuity |
output_format | No | pcm_16000, mulaw, opus, webm (default: pcm_16000) |
stt_model (or model) | No | STT model slug override — dispatches to the specific model instead of auto-routing |
tts_model | No | TTS model slug override — dispatches to the specific TTS model |
Response headers include full pipeline metadata (transcript, response text, per-stage latency, X-Xantly-STT-Model, X-Xantly-TTS-Model, provider used, cost, cache hit).
Single Turn: JSON voice interaction
For server-to-server integrations where you handle audio encoding yourself:
curl -X POST https://api.xantly.com/v1/voice/turn \
-H "Authorization: Bearer $XANTLY_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"input": {"type": "transcript", "text": "What is my balance?"},
"session_id": "optional-uuid",
"voice_profile": "alloy",
"latency_budget_ms": 300
}'Returns JSON with response_text, audio_base64, lane_used, model_used, latency breakdown, and cost.
Streaming: WebSocket bidirectional
const ws = new WebSocket("wss://api.xantly.com/v1/voice/realtime", {
headers: { "Authorization": "Bearer " + XANTLY_API_KEY }
});
ws.onopen = () => {
ws.send(JSON.stringify({
input: { type: "transcript", text: "What's the weather like?" },
voice_profile: "nova",
latency_budget_ms: 300
}));
};
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
// data.audio_base64 -- play immediately
// data.response_text -- display as caption
};S2S Provider Proxying
For direct Speech-to-Speech via native multimodal models, Xantly proxies to upstream providers with automatic session management:
OpenAI Realtime: Set default_provider: "openai_realtime" in org voice settings. Xantly establishes a wss:// connection to OpenAI's Realtime API, sends session configuration, and relays bidirectional audio frames.
Gemini Live: Set default_provider: "gemini_live". Same proxy pattern to Google's BidiGenerateContent WebSocket.
Note: S2S proxy connections bypass Xantly's intelligence pipeline (caching, memory, routing). They are raw provider passthroughs with key management and session setup. Use them when you need native multimodal capabilities that don't exist in the text pipeline.
Voice Sessions
Voice sessions enable multi-turn conversations with context persistence.
Create a session
curl -X POST https://api.xantly.com/v1/voice/sessions \
-H "Authorization: Bearer $XANTLY_API_KEY" \
-H "Content-Type: application/json" \
-d '{"session_id": "optional-uuid"}'Use session in requests
Pass session_id in your /v1/voice/chat or /v1/voice/turn requests. Xantly maintains:
- Turn history: Last 5 turns cached for context injection
- Chain linking: Active delegation chains persist across turns
- Session TTL: 30 minutes of inactivity
Delete a session
curl -X DELETE https://api.xantly.com/v1/voice/sessions/{session_id} \
-H "Authorization: Bearer $XANTLY_API_KEY"Voice Cloning
Upload a voice sample to create a custom voice profile backed by ElevenLabs:
curl -X POST https://api.xantly.com/v1/voice/profiles/upload \
-H "Authorization: Bearer $XANTLY_API_KEY" \
-F "profile_name=My Custom Voice" \
-F "file=@voice_sample.wav"The pipeline:
- Audio file stored in Cloudflare R2
- Sent to ElevenLabs voice cloning API
- Returns a
voice_idyou can use in any TTS request
{
"id": "uuid",
"profile_name": "My Custom Voice",
"provider": "elevenlabs",
"voice_id": "abc123def456",
"created_at": "2026-03-26T10:00:00Z"
}Use the voice_id in subsequent requests: "voice": "abc123def456" with "provider": "elevenlabs".
List voice profiles
curl https://api.xantly.com/v1/voice/profiles \
-H "Authorization: Bearer $XANTLY_API_KEY"Create a voice profile (without cloning)
curl -X POST https://api.xantly.com/v1/voice/profiles \
-H "Authorization: Bearer $XANTLY_API_KEY" \
-H "Content-Type: application/json" \
-d '{"profile_name": "Support Agent", "provider": "openai", "voice_id": "nova"}'Cost Controls
Per-stage cost breakdown
Every voice response includes a full cost breakdown:
| Component | Typical cost | Cacheable? |
|---|---|---|
| STT (Deepgram) | $0.0043/min | No |
| STT (Whisper) | $0.006/min | No |
| LLM inference | $0.0001-0.003/turn | Yes (semantic cache) |
| TTS (OpenAI) | $0.015/1K chars | No |
| TTS (ElevenLabs) | $0.05/1K chars | No |
On a cache hit, LLM inference cost drops to $0.00.
Plan-based voice limits
Voice quotas, rate limits, and concurrent session limits are enforced per plan. Per-org overrides are available via org_settings.
| Plan | Voice RPM | Concurrent sessions | Monthly minutes | Platform fee / min |
|---|---|---|---|---|
| Free | 3 | 1 | 3 min lifetime (one-time demo) | $0.00 |
| Pro | 60 | 5 | 500 | $0.02 |
| Scale | 500 | 25 | 5,000 | $0.015 |
| Pay-As-You-Go | 30 | 5 | Unlimited (credit-bounded) | $0.025 |
Voice costs are pass-through (provider cost) plus the per-minute platform fee above. Per-org overrides:
| Setting | Column on org_settings |
|---|---|
| RPM override | voice_rpm_limit |
| Concurrent session override | voice_concurrent_session_limit |
| Monthly minutes override | voice_monthly_minutes_limit |
| Platform fee override | voice_platform_fee_per_min |
| Monthly budget cap | voice_monthly_budget_usd |
Quota warnings fire once at 80%, 90%, and 100% per billing period. When the monthly quota or budget cap is exceeded, requests return 429 Too Many Requests. When the concurrent session limit is reached, requests also return 429. When PAYG credit balance falls below the minimum floor ($0.05), voice requests return 402 Payment Required.
Migrating from VAPI / Direct API
- Change your base URL to
https://api.xantly.com/v1 - Set your model to
"auto"-- Xantly handles routing - Add the voice header --
x-xantly-voice: true - Remove your model-switching logic -- Xantly does this for you
- Remove your caching layer -- Xantly's semantic cache is built in
Before (direct OpenAI)
client = openai.OpenAI(api_key="sk-...")
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": transcript}]
)After (Xantly)
client = openai.OpenAI(
api_key="xantly-key",
base_url="https://api.xantly.com/v1"
)
response = client.chat.completions.create(
model="auto",
messages=[{"role": "user", "content": transcript}],
extra_headers={"x-xantly-voice": "true"}
)Same SDK. Same code pattern. 80-90% less cost.
Response Headers Reference
| Header | Description |
|---|---|
X-Xantly-Transcript | STT transcript of input audio (truncated to 500 chars) |
X-Xantly-Response-Text | LLM response text (truncated to 500 chars) |
X-Xantly-STT-Provider | The STT provider that served this request (deepgram, openai, groq, elevenlabs) |
X-Xantly-TTS-Provider | The TTS provider that served this request (openai, elevenlabs, deepgram, groq, google) |
X-Xantly-STT-Model | Specific STT model slug used (e.g. groq/whisper-large-v3-turbo, openai/whisper-1) |
X-Xantly-TTS-Model | Specific TTS model slug used (e.g. elevenlabs/eleven_flash_v2_5, openai/gpt-4o-mini-tts) |
X-Xantly-STT-Latency-Ms | Speech-to-text stage latency |
X-Xantly-Inference-Latency-Ms | LLM inference stage latency |
X-Xantly-TTS-Latency-Ms | Text-to-speech stage latency |
X-Xantly-Lane-Used | FastLane or DelegationLane |
X-Xantly-Model-Used | Provider/model that served inference |
X-Xantly-Cost-USD | Total cost for this request (provider pass-through + platform fee) |
X-Xantly-Cache-Hit | true if served from semantic cache |
What's Next
- Cost-Optimized Routing -- Fine-tune routing behavior with hints and overrides
- Streaming Responses -- Enable SSE streaming for text responses
- Chat Completions API -- Full API reference for the unified endpoint
Frequently Asked Questions
How much cheaper is Xantly for voice?
Xantly reduces voice agent costs by 80-90% compared to direct API calls — from $0.06+ per turn with GPT-4o to $0.003-0.01 per turn with auto-routing. Cost savings come from intelligent model selection (routing simple conversational requests to fast, cheap models), semantic caching at the transcript level (eliminating redundant LLM calls), and transcript normalization that strips filler words to maximize cache hit rates.
What voice models does Xantly support?
Xantly supports 30+ voice models across the full pipeline: STT models include Whisper (OpenAI, Groq), Deepgram Nova-2/Nova-3, and ElevenLabs Scribe; TTS models include OpenAI gpt-4o-mini-tts, ElevenLabs (eleven_multilingual_v2, eleven_flash_v2_5), Deepgram Aura-2, and Google Cloud TTS; plus Realtime bidirectional audio (OpenAI Realtime, Gemini Live) and Audio LLMs for native multimodal processing.
What latency can I expect?
Xantly's Fast Lane delivers sub-300ms end-to-end latency for 80% of voice requests: ~100ms STT (Deepgram for English), ~50-100ms LLM inference via the fastest available model, and ~100-200ms TTS. Complex multi-step requests route to the Delegation Lane, which immediately returns a filler response so the user never waits in silence, then executes the full workflow in the background.
Can I use my own voice provider API keys?
Yes. Xantly supports Bring Your Own Key (BYOK) for all voice providers — including OpenAI, Deepgram, ElevenLabs, Groq, and Google. BYOK lets you use your own billing relationship and negotiated rates while still benefiting from Xantly's intelligent routing, semantic caching, and automatic failover across providers.