Rate Limits
Xantly enforces per-organization rate limits using a distributed sliding window algorithm. Limits are applied per endpoint category and per minute.
Xantly enforces per-organization rate limits using a distributed sliding window algorithm. Limits are applied per endpoint category and per minute.
How rate limiting works
- Sliding window — enforced per organization per endpoint via Redis atomic Lua scripts
- Dual enforcement for inference — chat completions and embeddings check both RPM (requests per minute) and TPM (tokens per minute) simultaneously to prevent quota burn attacks
- Other endpoints — RPM only
Default limits by endpoint
| Endpoint category | Default RPM | Notes |
|---|---|---|
Inference (/v1/chat/completions, /v1/embeddings) | Plan-dependent | Both RPM and TPM enforced |
Voice (/v1/voice/*) | Plan-dependent | RPM + monthly minutes + concurrent sessions (see below) |
| Governance writes | 5 | PUT/POST/DELETE on /v1/governance |
| Routing writes | 10 | PUT on /v1/routing |
| Formal specs | 50 | /v1/governance/formal-specs |
| Reliability writes | 100 | POST/PUT on /v1/reliability |
Actual limits for your organization are set in your plan and may differ from defaults. Check your dashboard for current limits.
Voice rate limits
Voice endpoints (/v1/voice/transcribe, /v1/voice/synthesize, /v1/voice/chat, /v1/voice/stream, /v1/voice/realtime, /v1/voice/turn) have three independent enforcement dimensions, all evaluated on every request:
- Voice RPM — sliding-window requests-per-minute, identical mechanism to inference RPM but a separate counter
- Monthly audio minutes quota — total
voice_input_audio_mssummed across the calendar month - Concurrent voice sessions — distinct active sessions in the
voice:concurrent:{org_id}Redis ZSET
If any one of these is exceeded, the request returns 429 Too Many Requests (or 402 Payment Required for budget/credit exhaustion).
Per-plan defaults
| Plan | Voice RPM | Monthly minutes | Concurrent sessions | Concurrency TTL |
|---|---|---|---|---|
| Free | 3 | 3 minutes (lifetime, one-time demo) | 1 | 10 min |
| Pro | 60 | 500 / month | 5 | 10 min |
| Scale | 500 | 5,000 / month | 25 | 60 min |
| Pay-As-You-Go | 30 | Unlimited (credit-bounded) | 5 | 30 min |
The concurrency TTL is how long an idle voice session remains in the active set before the reaper releases its slot. Long-running streaming sessions on Scale tier (60-min TTL) get heartbeats from the WebSocket handler every turn so they aren't reaped mid-call.
Per-org overrides
Every per-plan voice limit can be overridden per-org via org_settings:
| Setting | Column | Notes |
|---|---|---|
| Voice RPM | voice_rpm_limit (int) | NULL = use plan default |
| Monthly minutes | voice_monthly_minutes_limit (int) | NULL = use plan default; ignored for Free (uses lifetime) |
| Concurrent sessions | voice_concurrent_session_limit (int) | NULL = use plan default |
| Monthly budget cap | voice_monthly_budget_usd (numeric) | Hard $ ceiling regardless of minute quota |
| Platform fee per minute | voice_platform_fee_per_min (numeric) | NULL = use plan default |
Free tier specifics
Free tier voice minutes are a lifetime quota, not monthly. The first 3 audio minutes are free; after that, voice requests return 429. Tracked via org_settings.voice_free_minutes_used. Upgrade to Pro for 500 minutes/month.
Enforcement order
When a voice request arrives, Xantly evaluates the checks in this order. The first failure returns immediately:
- Monthly USD budget cap — same
budget:usage:{org}:general:{YYYY-MM}Redis pool as inference. Returns402 Payment Required. - Monthly voice minutes quota —
voice:audio_mins:{org}:{YYYY-MM}Redis key vs the plan/org limit. Returns429. - Free-tier lifetime quota — only on Free plan, checks
voice_free_minutes_used >= 3. Returns429. - PAYG credit floor — PAYG accounts must have
credit_balance_cents >= 5($0.05) to start a request. Returns402. - Concurrent session limit — Lua script checks ZCARD against the plan/org limit. Returns
429. - Voice RPM (sliding window) — middleware-level RPM enforcement. Returns
429.
Cost-aware sub-pipeline
Inside the voice pipeline, the LLM stage uses BaRP (the existing routing fairness layer) with a VoiceOnly constraint that prefers models with avg_latency_ms < 300. This sub-pipeline does not consume the voice RPM bucket — only the outer voice request does.
Rate limit headers for voice
Voice endpoints return the same standard headers as other endpoints:
| Header | Voice meaning |
|---|---|
X-RateLimit-Limit | Voice RPM limit (per-org or plan default) |
X-RateLimit-Remaining | Voice requests remaining in the current minute |
X-RateLimit-Reset | Unix timestamp when the voice RPM window resets |
Retry-After | Set on 429, in seconds, when any voice limit is exceeded |
For per-request voice cost + model headers, see Voice Billing and Voice Agents.
Rate limit response headers
Every response includes headers showing your current rate limit status:
| Header | Description |
|---|---|
X-RateLimit-Limit | Maximum requests allowed per minute |
X-RateLimit-Remaining | Requests remaining in the current window |
X-RateLimit-Reset | Unix timestamp when the window resets |
RateLimit-Limit | Same as X-RateLimit-Limit (RFC 9333 alias) |
RateLimit-Remaining | Same as X-RateLimit-Remaining (RFC 9333 alias) |
RateLimit-Reset | Same as X-RateLimit-Reset (RFC 9333 alias) |
When a limit is exceeded, a Retry-After header is also included:
| Header | Description |
|---|---|
Retry-After | Seconds to wait before retrying |
Rate limit exceeded — 429 response
{
"error": {
"message": "Rate limit exceeded: 1000 requests per minute",
"type": "rate_limit_error",
"code": "rate_limit_exceeded"
}
}Handling rate limits
Exponential backoff with jitter (recommended)
import time
import random
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["XANTLY_API_KEY"],
base_url="https://api.xantly.com/v1",
)
def chat_with_backoff(messages, max_retries=6):
for attempt in range(max_retries):
try:
return client.chat.completions.create(
model="auto",
messages=messages,
)
except Exception as e:
if "429" in str(e) or "rate_limit" in str(e):
wait = (2 ** attempt) + random.uniform(0, 1)
print(f"Rate limited. Waiting {wait:.1f}s...")
time.sleep(wait)
else:
raise
raise Exception("Max retries exceeded")async function chatWithBackoff(messages, maxRetries = 6) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await client.chat.completions.create({ model: "auto", messages });
} catch (err) {
if (err?.status === 429) {
const wait = Math.pow(2, attempt) * 1000 + Math.random() * 1000;
console.log(`Rate limited. Waiting ${(wait / 1000).toFixed(1)}s...`);
await new Promise((r) => setTimeout(r, wait));
} else {
throw err;
}
}
}
throw new Error("Max retries exceeded");
}Inspect headers before retrying
import httpx
response = httpx.post(
"https://api.xantly.com/v1/chat/completions",
headers={"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"},
json={"model": "auto", "messages": [{"role": "user", "content": "Hello"}]},
)
remaining = int(response.headers.get("X-RateLimit-Remaining", 0))
if remaining < 10:
print(f"Warning: only {remaining} requests remaining in this window")Best practices
- Use
model: "auto"— the gateway's intelligent routing can reuse cached responses and distribute load across providers, reducing your effective RPM. - Enable semantic caching (
xantly.enable_cache: true, default) — cache hits don't consume inference quota. - Batch when possible — combine multiple inputs into a single embeddings request.
- Monitor headers — log
X-RateLimit-Remainingin production to catch approaching limits before they hit. - Use
service_tier: "batch"— signals a cost/latency preference that may take advantage of off-peak capacity.
Next steps
- Billing & Quotas — Token quotas and monthly budget limits
- Chat Completions — Main inference endpoint
Voice Billing
Voice requests use the same API key, the same budget pool, and the same monthly invoice as chat. What differs is the pricing unit (minutes and characters, not tokens) and the addition of a per-minute
Billing & Quotas
Xantly enforces per-organization token quotas and optional monthly spend budgets to prevent unexpected costs. Both systems are checked before each request is processed.