XantlyANTLY
Guides

Voice Agents

Build production voice agents that are 80-90% cheaper than direct API calls -- with guaranteed sub-300ms latency, built-in memory, and zero model lock-in.

Build production voice agents that are 80-90% cheaper than direct API calls -- with guaranteed sub-300ms latency, built-in memory, and zero model lock-in.

Xantly is the infrastructure layer between your voice agent and LLM providers. We handle routing, caching, memory, and cost optimization so you can focus on building the product your customers love.


Why Xantly for Voice

ChallengeWithout XantlyWith Xantly
Cost$0.06+ per turn (GPT-4o)$0.003-0.01 per turn (auto-routed)
Latency800ms-2s (full pipeline)<300ms TTFB (Fast Lane)
Model lock-inHardcoded to one providerAuto-routes to cheapest qualifying model
MemoryYou build it yourselfBuilt-in semantic memory across sessions
ScalingYou manage provider keys & rate limitsWe handle failover, circuit breakers, rate limits

Drop-in compatible. If you already use OpenAI's chat completions API, you can switch to Xantly by changing one line -- your base URL.


Quick Start

Option 1: Voice header on chat completions (simplest)

Add the x-xantly-voice header to any chat completion request. Xantly automatically activates the Voice Engine -- sub-300ms routing, semantic caching, and cost optimization.

import openai

client = openai.OpenAI(
    api_key="your-xantly-api-key",
    base_url="https://api.xantly.com/v1"
)

response = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "What's my account balance?"}],
    extra_headers={"x-xantly-voice": "true"}
)

print(response.choices[0].message.content)

Option 2: Voice-native API (audio in, audio out)

For full audio workflows, use the dedicated voice endpoints:

curl -X POST https://api.xantly.com/v1/voice/chat \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -F "[email protected]" \
  -F "model=auto" \
  -F "voice=alloy" \
  -F "language=en" \
  -F "output_format=pcm_16000" \
  --output response.pcm

The response is raw audio bytes with metadata in headers:

X-Xantly-Transcript: What's my account balance?
X-Xantly-Response-Text: Your account balance is $1,234.56
X-Xantly-STT-Provider: deepgram
X-Xantly-TTS-Provider: openai
X-Xantly-STT-Latency-Ms: 95.2
X-Xantly-Inference-Latency-Ms: 48.7
X-Xantly-TTS-Latency-Ms: 187.3
X-Xantly-Lane-Used: FastLane
X-Xantly-Model-Used: groq/llama-3.3-70b
X-Xantly-Cost-USD: 0.000320
X-Xantly-Cache-Hit: false

How the Voice Engine Works

Two-Lane Hybrid Architecture

Every voice request is classified in under 1ms into one of two lanes:

Fast Lane -- Simple conversational requests (80% of voice traffic)

  • STT (~100ms for English via Deepgram, ~500ms for other languages via Whisper)
  • LLM inference via fastest available model (~50-100ms)
  • TTS (~100-200ms)
  • Total: <300ms end-to-end

Delegation Lane -- Complex multi-step tasks (20% of voice traffic)

  • Immediately generates a filler response ("Let me look that up...")
  • Executes the full workflow in the background via chain execution
  • Returns the real answer when ready
  • User never waits in silence

Lane classification uses Aho-Corasick multi-pattern matching (<1ms). Requests with multi-step keywords ("first...then", "calculate", "search for") or >150 words route to Delegation Lane.

Intelligent Model Routing

When you send model: "auto" (or set voice_fast_lane_model: "auto" in org settings), the Voice Engine selects the optimal model via BaRP (Bandit-Assisted Routing and Personalisation):

  1. Real-time latency -- EWMA of actual inference times
  2. Cost -- Routes to the cheapest model that meets the latency SLA
  3. Circuit breakers -- Disables models that breach P95 latency thresholds
  4. Per-tenant learning -- LinUCB bandit algorithm learns your traffic patterns
  5. Voice-specific feedback -- Every voice inference submits latency, cost, and outcome back to BaRP, continuously optimizing model selection for sub-300ms response times

Multi-Provider STT

LanguageProviderCostLatency
English (en-*)Deepgram Nova-2$0.0043/min~100ms
All othersOpenAI Whisper$0.006/min~500ms

Routing is automatic -- English audio goes to Deepgram (30% cheaper, lower latency), everything else goes to Whisper. If Deepgram is unavailable, all traffic falls back to Whisper.

Multi-Provider TTS

ProviderModelCostBest for
OpenAI (default)gpt-4o-mini-tts$0.015/1K charsGeneral purpose, streaming
ElevenLabs (override)eleven_multilingual_v2$0.05/1K charsPremium quality, voice cloning

Specify voice_provider: "elevenlabs" to use ElevenLabs. Default is OpenAI TTS.

Semantic Caching

The Voice Engine caches at the transcript level with two layers:

LayerLookupUse case
Exact match (Redis)SHA-256 fingerprintIdentical transcripts
Semantic match (Qdrant)Cosine similarity + Jaccard safetySimilar transcripts with different wording

Voice transcripts are normalized before caching -- filler words ("um", "uh", "like"), repeated words ("the the"), and other STT artifacts are stripped. This means "um what's uh my balance?" hits the same cache entry as "what's my balance?".


Dedicated Voice API

STT Only: Transcribe audio to text

curl -X POST https://api.xantly.com/v1/voice/transcribe \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -F "[email protected]" \
  -F "language=en" \
  -F "stt_model=groq/whisper-large-v3-turbo"
FieldRequiredDescription
audioYesInput audio file (WAV, PCM, MP3, OGG, WebM)
languageNoSTT language hint, ISO-639-1 (default: en)
stt_model (or model)NoSTT model slug override (e.g. groq/whisper-large-v3-turbo, openai/whisper-1, deepgram/nova-3). Default: language-based auto-routing

Response:

{
  "text": "What is my account balance?",
  "provider": "deepgram",
  "language": "en",
  "cost_usd": 0.000168
}

TTS Only: Synthesize text to audio

curl -X POST https://api.xantly.com/v1/voice/synthesize \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Your account balance is $1,234.56",
    "voice": "alloy",
    "provider": "openai",
    "model": "openai/gpt-4o-mini-tts",
    "output_format": "pcm_16000"
  }' \
  --output response.pcm
FieldRequiredDescription
textYesText to synthesize
voiceNoVoice profile ID (default: org setting)
providerNoTTS provider (openai, elevenlabs, deepgram, groq, google)
modelNoTTS model slug override (e.g. elevenlabs/eleven_flash_v2_5, deepgram/aura-2, openai/tts-1-hd). Default: provider-based auto-routing
output_formatNopcm_16000, mulaw, opus, webm

Response: raw audio bytes with headers X-Xantly-TTS-Provider, X-Xantly-TTS-Model, X-Xantly-TTS-Latency-Ms, X-Xantly-Cost-USD.

Full Pipeline: Audio in, audio out

curl -X POST https://api.xantly.com/v1/voice/chat \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -F "[email protected]" \
  -F "voice=alloy" \
  -F "language=en" \
  -F "session_id=550e8400-e29b-41d4-a716-446655440000" \
  -F "output_format=pcm_16000" \
  -F "stt_model=groq/whisper-large-v3-turbo" \
  -F "tts_model=elevenlabs/eleven_flash_v2_5"
FieldRequiredDescription
audioYesInput audio file (WAV, PCM, MP3, OGG, WebM)
voiceNoTTS voice profile (default: org setting)
languageNoSTT language hint (default: en)
session_idNoUUID for multi-turn conversation continuity
output_formatNopcm_16000, mulaw, opus, webm (default: pcm_16000)
stt_model (or model)NoSTT model slug override — dispatches to the specific model instead of auto-routing
tts_modelNoTTS model slug override — dispatches to the specific TTS model

Response headers include full pipeline metadata (transcript, response text, per-stage latency, X-Xantly-STT-Model, X-Xantly-TTS-Model, provider used, cost, cache hit).

Single Turn: JSON voice interaction

For server-to-server integrations where you handle audio encoding yourself:

curl -X POST https://api.xantly.com/v1/voice/turn \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input": {"type": "transcript", "text": "What is my balance?"},
    "session_id": "optional-uuid",
    "voice_profile": "alloy",
    "latency_budget_ms": 300
  }'

Returns JSON with response_text, audio_base64, lane_used, model_used, latency breakdown, and cost.

Streaming: WebSocket bidirectional

const ws = new WebSocket("wss://api.xantly.com/v1/voice/realtime", {
  headers: { "Authorization": "Bearer " + XANTLY_API_KEY }
});

ws.onopen = () => {
  ws.send(JSON.stringify({
    input: { type: "transcript", text: "What's the weather like?" },
    voice_profile: "nova",
    latency_budget_ms: 300
  }));
};

ws.onmessage = (event) => {
  const data = JSON.parse(event.data);
  // data.audio_base64 -- play immediately
  // data.response_text -- display as caption
};

S2S Provider Proxying

For direct Speech-to-Speech via native multimodal models, Xantly proxies to upstream providers with automatic session management:

OpenAI Realtime: Set default_provider: "openai_realtime" in org voice settings. Xantly establishes a wss:// connection to OpenAI's Realtime API, sends session configuration, and relays bidirectional audio frames.

Gemini Live: Set default_provider: "gemini_live". Same proxy pattern to Google's BidiGenerateContent WebSocket.

Note: S2S proxy connections bypass Xantly's intelligence pipeline (caching, memory, routing). They are raw provider passthroughs with key management and session setup. Use them when you need native multimodal capabilities that don't exist in the text pipeline.


Voice Sessions

Voice sessions enable multi-turn conversations with context persistence.

Create a session

curl -X POST https://api.xantly.com/v1/voice/sessions \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"session_id": "optional-uuid"}'

Use session in requests

Pass session_id in your /v1/voice/chat or /v1/voice/turn requests. Xantly maintains:

  • Turn history: Last 5 turns cached for context injection
  • Chain linking: Active delegation chains persist across turns
  • Session TTL: 30 minutes of inactivity

Delete a session

curl -X DELETE https://api.xantly.com/v1/voice/sessions/{session_id} \
  -H "Authorization: Bearer $XANTLY_API_KEY"

Voice Cloning

Upload a voice sample to create a custom voice profile backed by ElevenLabs:

curl -X POST https://api.xantly.com/v1/voice/profiles/upload \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -F "profile_name=My Custom Voice" \
  -F "file=@voice_sample.wav"

The pipeline:

  1. Audio file stored in Cloudflare R2
  2. Sent to ElevenLabs voice cloning API
  3. Returns a voice_id you can use in any TTS request
{
  "id": "uuid",
  "profile_name": "My Custom Voice",
  "provider": "elevenlabs",
  "voice_id": "abc123def456",
  "created_at": "2026-03-26T10:00:00Z"
}

Use the voice_id in subsequent requests: "voice": "abc123def456" with "provider": "elevenlabs".

List voice profiles

curl https://api.xantly.com/v1/voice/profiles \
  -H "Authorization: Bearer $XANTLY_API_KEY"

Create a voice profile (without cloning)

curl -X POST https://api.xantly.com/v1/voice/profiles \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"profile_name": "Support Agent", "provider": "openai", "voice_id": "nova"}'

Cost Controls

Per-stage cost breakdown

Every voice response includes a full cost breakdown:

ComponentTypical costCacheable?
STT (Deepgram)$0.0043/minNo
STT (Whisper)$0.006/minNo
LLM inference$0.0001-0.003/turnYes (semantic cache)
TTS (OpenAI)$0.015/1K charsNo
TTS (ElevenLabs)$0.05/1K charsNo

On a cache hit, LLM inference cost drops to $0.00.

Plan-based voice limits

Voice quotas, rate limits, and concurrent session limits are enforced per plan. Per-org overrides are available via org_settings.

PlanVoice RPMConcurrent sessionsMonthly minutesPlatform fee / min
Free313 min lifetime (one-time demo)$0.00
Pro605500$0.02
Scale500255,000$0.015
Pay-As-You-Go305Unlimited (credit-bounded)$0.025

Voice costs are pass-through (provider cost) plus the per-minute platform fee above. Per-org overrides:

SettingColumn on org_settings
RPM overridevoice_rpm_limit
Concurrent session overridevoice_concurrent_session_limit
Monthly minutes overridevoice_monthly_minutes_limit
Platform fee overridevoice_platform_fee_per_min
Monthly budget capvoice_monthly_budget_usd

Quota warnings fire once at 80%, 90%, and 100% per billing period. When the monthly quota or budget cap is exceeded, requests return 429 Too Many Requests. When the concurrent session limit is reached, requests also return 429. When PAYG credit balance falls below the minimum floor ($0.05), voice requests return 402 Payment Required.


Migrating from VAPI / Direct API

  1. Change your base URL to https://api.xantly.com/v1
  2. Set your model to "auto" -- Xantly handles routing
  3. Add the voice header -- x-xantly-voice: true
  4. Remove your model-switching logic -- Xantly does this for you
  5. Remove your caching layer -- Xantly's semantic cache is built in

Before (direct OpenAI)

client = openai.OpenAI(api_key="sk-...")
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": transcript}]
)

After (Xantly)

client = openai.OpenAI(
    api_key="xantly-key",
    base_url="https://api.xantly.com/v1"
)
response = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": transcript}],
    extra_headers={"x-xantly-voice": "true"}
)

Same SDK. Same code pattern. 80-90% less cost.


Response Headers Reference

HeaderDescription
X-Xantly-TranscriptSTT transcript of input audio (truncated to 500 chars)
X-Xantly-Response-TextLLM response text (truncated to 500 chars)
X-Xantly-STT-ProviderThe STT provider that served this request (deepgram, openai, groq, elevenlabs)
X-Xantly-TTS-ProviderThe TTS provider that served this request (openai, elevenlabs, deepgram, groq, google)
X-Xantly-STT-ModelSpecific STT model slug used (e.g. groq/whisper-large-v3-turbo, openai/whisper-1)
X-Xantly-TTS-ModelSpecific TTS model slug used (e.g. elevenlabs/eleven_flash_v2_5, openai/gpt-4o-mini-tts)
X-Xantly-STT-Latency-MsSpeech-to-text stage latency
X-Xantly-Inference-Latency-MsLLM inference stage latency
X-Xantly-TTS-Latency-MsText-to-speech stage latency
X-Xantly-Lane-UsedFastLane or DelegationLane
X-Xantly-Model-UsedProvider/model that served inference
X-Xantly-Cost-USDTotal cost for this request (provider pass-through + platform fee)
X-Xantly-Cache-Hittrue if served from semantic cache

What's Next


Frequently Asked Questions

How much cheaper is Xantly for voice?

Xantly reduces voice agent costs by 80-90% compared to direct API calls — from $0.06+ per turn with GPT-4o to $0.003-0.01 per turn with auto-routing. Cost savings come from intelligent model selection (routing simple conversational requests to fast, cheap models), semantic caching at the transcript level (eliminating redundant LLM calls), and transcript normalization that strips filler words to maximize cache hit rates.

What voice models does Xantly support?

Xantly supports 30+ voice models across the full pipeline: STT models include Whisper (OpenAI, Groq), Deepgram Nova-2/Nova-3, and ElevenLabs Scribe; TTS models include OpenAI gpt-4o-mini-tts, ElevenLabs (eleven_multilingual_v2, eleven_flash_v2_5), Deepgram Aura-2, and Google Cloud TTS; plus Realtime bidirectional audio (OpenAI Realtime, Gemini Live) and Audio LLMs for native multimodal processing.

What latency can I expect?

Xantly's Fast Lane delivers sub-300ms end-to-end latency for 80% of voice requests: ~100ms STT (Deepgram for English), ~50-100ms LLM inference via the fastest available model, and ~100-200ms TTS. Complex multi-step requests route to the Delegation Lane, which immediately returns a filler response so the user never waits in silence, then executes the full workflow in the background.

Can I use my own voice provider API keys?

Yes. Xantly supports Bring Your Own Key (BYOK) for all voice providers — including OpenAI, Deepgram, ElevenLabs, Groq, and Google. BYOK lets you use your own billing relationship and negotiated rates while still benefiting from Xantly's intelligent routing, semantic caching, and automatic failover across providers.

On this page