XantlyANTLY
API Reference

Rate Limits

Xantly enforces per-organization rate limits using a distributed sliding window algorithm. Limits are applied per endpoint category and per minute.

Xantly enforces per-organization rate limits using a distributed sliding window algorithm. Limits are applied per endpoint category and per minute.


How rate limiting works

  • Sliding window — enforced per organization per endpoint via Redis atomic Lua scripts
  • Dual enforcement for inference — chat completions and embeddings check both RPM (requests per minute) and TPM (tokens per minute) simultaneously to prevent quota burn attacks
  • Other endpoints — RPM only

Default limits by endpoint

Endpoint categoryDefault RPMNotes
Inference (/v1/chat/completions, /v1/embeddings)Plan-dependentBoth RPM and TPM enforced
Voice (/v1/voice/*)Plan-dependentRPM + monthly minutes + concurrent sessions (see below)
Governance writes5PUT/POST/DELETE on /v1/governance
Routing writes10PUT on /v1/routing
Formal specs50/v1/governance/formal-specs
Reliability writes100POST/PUT on /v1/reliability

Actual limits for your organization are set in your plan and may differ from defaults. Check your dashboard for current limits.


Voice rate limits

Voice endpoints (/v1/voice/transcribe, /v1/voice/synthesize, /v1/voice/chat, /v1/voice/stream, /v1/voice/realtime, /v1/voice/turn) have three independent enforcement dimensions, all evaluated on every request:

  1. Voice RPM — sliding-window requests-per-minute, identical mechanism to inference RPM but a separate counter
  2. Monthly audio minutes quota — total voice_input_audio_ms summed across the calendar month
  3. Concurrent voice sessions — distinct active sessions in the voice:concurrent:{org_id} Redis ZSET

If any one of these is exceeded, the request returns 429 Too Many Requests (or 402 Payment Required for budget/credit exhaustion).

Per-plan defaults

PlanVoice RPMMonthly minutesConcurrent sessionsConcurrency TTL
Free33 minutes (lifetime, one-time demo)110 min
Pro60500 / month510 min
Scale5005,000 / month2560 min
Pay-As-You-Go30Unlimited (credit-bounded)530 min

The concurrency TTL is how long an idle voice session remains in the active set before the reaper releases its slot. Long-running streaming sessions on Scale tier (60-min TTL) get heartbeats from the WebSocket handler every turn so they aren't reaped mid-call.

Per-org overrides

Every per-plan voice limit can be overridden per-org via org_settings:

SettingColumnNotes
Voice RPMvoice_rpm_limit (int)NULL = use plan default
Monthly minutesvoice_monthly_minutes_limit (int)NULL = use plan default; ignored for Free (uses lifetime)
Concurrent sessionsvoice_concurrent_session_limit (int)NULL = use plan default
Monthly budget capvoice_monthly_budget_usd (numeric)Hard $ ceiling regardless of minute quota
Platform fee per minutevoice_platform_fee_per_min (numeric)NULL = use plan default

Free tier specifics

Free tier voice minutes are a lifetime quota, not monthly. The first 3 audio minutes are free; after that, voice requests return 429. Tracked via org_settings.voice_free_minutes_used. Upgrade to Pro for 500 minutes/month.

Enforcement order

When a voice request arrives, Xantly evaluates the checks in this order. The first failure returns immediately:

  1. Monthly USD budget cap — same budget:usage:{org}:general:{YYYY-MM} Redis pool as inference. Returns 402 Payment Required.
  2. Monthly voice minutes quotavoice:audio_mins:{org}:{YYYY-MM} Redis key vs the plan/org limit. Returns 429.
  3. Free-tier lifetime quota — only on Free plan, checks voice_free_minutes_used >= 3. Returns 429.
  4. PAYG credit floor — PAYG accounts must have credit_balance_cents >= 5 ($0.05) to start a request. Returns 402.
  5. Concurrent session limit — Lua script checks ZCARD against the plan/org limit. Returns 429.
  6. Voice RPM (sliding window) — middleware-level RPM enforcement. Returns 429.

Cost-aware sub-pipeline

Inside the voice pipeline, the LLM stage uses BaRP (the existing routing fairness layer) with a VoiceOnly constraint that prefers models with avg_latency_ms < 300. This sub-pipeline does not consume the voice RPM bucket — only the outer voice request does.

Rate limit headers for voice

Voice endpoints return the same standard headers as other endpoints:

HeaderVoice meaning
X-RateLimit-LimitVoice RPM limit (per-org or plan default)
X-RateLimit-RemainingVoice requests remaining in the current minute
X-RateLimit-ResetUnix timestamp when the voice RPM window resets
Retry-AfterSet on 429, in seconds, when any voice limit is exceeded

For per-request voice cost + model headers, see Voice Billing and Voice Agents.


Rate limit response headers

Every response includes headers showing your current rate limit status:

HeaderDescription
X-RateLimit-LimitMaximum requests allowed per minute
X-RateLimit-RemainingRequests remaining in the current window
X-RateLimit-ResetUnix timestamp when the window resets
RateLimit-LimitSame as X-RateLimit-Limit (RFC 9333 alias)
RateLimit-RemainingSame as X-RateLimit-Remaining (RFC 9333 alias)
RateLimit-ResetSame as X-RateLimit-Reset (RFC 9333 alias)

When a limit is exceeded, a Retry-After header is also included:

HeaderDescription
Retry-AfterSeconds to wait before retrying

Rate limit exceeded — 429 response

{
  "error": {
    "message": "Rate limit exceeded: 1000 requests per minute",
    "type": "rate_limit_error",
    "code": "rate_limit_exceeded"
  }
}

Handling rate limits

import time
import random
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["XANTLY_API_KEY"],
    base_url="https://api.xantly.com/v1",
)

def chat_with_backoff(messages, max_retries=6):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model="auto",
                messages=messages,
            )
        except Exception as e:
            if "429" in str(e) or "rate_limit" in str(e):
                wait = (2 ** attempt) + random.uniform(0, 1)
                print(f"Rate limited. Waiting {wait:.1f}s...")
                time.sleep(wait)
            else:
                raise
    raise Exception("Max retries exceeded")
async function chatWithBackoff(messages, maxRetries = 6) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await client.chat.completions.create({ model: "auto", messages });
    } catch (err) {
      if (err?.status === 429) {
        const wait = Math.pow(2, attempt) * 1000 + Math.random() * 1000;
        console.log(`Rate limited. Waiting ${(wait / 1000).toFixed(1)}s...`);
        await new Promise((r) => setTimeout(r, wait));
      } else {
        throw err;
      }
    }
  }
  throw new Error("Max retries exceeded");
}

Inspect headers before retrying

import httpx

response = httpx.post(
    "https://api.xantly.com/v1/chat/completions",
    headers={"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"},
    json={"model": "auto", "messages": [{"role": "user", "content": "Hello"}]},
)

remaining = int(response.headers.get("X-RateLimit-Remaining", 0))
if remaining < 10:
    print(f"Warning: only {remaining} requests remaining in this window")

Best practices

  1. Use model: "auto" — the gateway's intelligent routing can reuse cached responses and distribute load across providers, reducing your effective RPM.
  2. Enable semantic caching (xantly.enable_cache: true, default) — cache hits don't consume inference quota.
  3. Batch when possible — combine multiple inputs into a single embeddings request.
  4. Monitor headers — log X-RateLimit-Remaining in production to catch approaching limits before they hit.
  5. Use service_tier: "batch" — signals a cost/latency preference that may take advantage of off-peak capacity.

Next steps

On this page