Rate Limits

Xantly enforces per-organization rate limits using a distributed sliding window algorithm. Limits are applied per endpoint category and per minute.

How rate limiting works

Sliding window — enforced per organization per endpoint via Redis atomic Lua scripts
Dual enforcement for inference — chat completions and embeddings check both RPM (requests per minute) and TPM (tokens per minute) simultaneously to prevent quota burn attacks
Other endpoints — RPM only

Default limits by endpoint

Endpoint category	Default RPM	Notes
Inference (`/v1/chat/completions`, `/v1/embeddings`)	Plan-dependent	Both RPM and TPM enforced
Voice (`/v1/voice/*`)	Plan-dependent	RPM + monthly minutes + concurrent sessions (see below)
Governance writes	5	PUT/POST/DELETE on `/v1/governance`
Routing writes	10	PUT on `/v1/routing`
Formal specs	50	`/v1/governance/formal-specs`
Reliability writes	100	POST/PUT on `/v1/reliability`

Actual limits for your organization are set in your plan and may differ from defaults. Check your dashboard for current limits.

Voice endpoints (/v1/voice/transcribe, /v1/voice/synthesize, /v1/voice/chat, /v1/voice/stream, /v1/voice/realtime, /v1/voice/turn) have three independent enforcement dimensions, all evaluated on every request:

Voice RPM — sliding-window requests-per-minute, identical mechanism to inference RPM but a separate counter
Monthly audio minutes quota — total voice_input_audio_ms summed across the calendar month
Concurrent voice sessions — distinct active sessions in the voice:concurrent:{org_id} Redis ZSET

If any one of these is exceeded, the request returns 429 Too Many Requests (or 402 Payment Required for budget/credit exhaustion).

Per-plan defaults

Plan	Voice RPM	Monthly minutes	Concurrent sessions	Concurrency TTL
Free	3	3 minutes (lifetime, one-time demo)	1	10 min
Pro	60	500 / month	5	10 min
Scale	500	5,000 / month	25	60 min
Pay-As-You-Go	30	Unlimited (credit-bounded)	5	30 min

The concurrency TTL is how long an idle voice session remains in the active set before the reaper releases its slot. Long-running streaming sessions on Scale tier (60-min TTL) get heartbeats from the WebSocket handler every turn so they aren't reaped mid-call.

Per-org overrides

Every per-plan voice limit can be overridden per-org via org_settings:

Setting	Column	Notes
Voice RPM	`voice_rpm_limit` (`int`)	NULL = use plan default
Monthly minutes	`voice_monthly_minutes_limit` (`int`)	NULL = use plan default; ignored for Free (uses lifetime)
Concurrent sessions	`voice_concurrent_session_limit` (`int`)	NULL = use plan default
Monthly budget cap	`voice_monthly_budget_usd` (`numeric`)	Hard $ ceiling regardless of minute quota
Platform fee per minute	`voice_platform_fee_per_min` (`numeric`)	NULL = use plan default

Free tier specifics

Free tier voice minutes are a lifetime quota, not monthly. The first 3 audio minutes are free; after that, voice requests return 429. Tracked via org_settings.voice_free_minutes_used. Upgrade to Pro for 500 minutes/month.

Enforcement order

When a voice request arrives, Xantly evaluates the checks in this order. The first failure returns immediately:

Monthly USD budget cap — same budget:usage:{org}:general:{YYYY-MM} Redis pool as inference. Returns 402 Payment Required.
Monthly voice minutes quota — voice:audio_mins:{org}:{YYYY-MM} Redis key vs the plan/org limit. Returns 429.
Free-tier lifetime quota — only on Free plan, checks voice_free_minutes_used >= 3. Returns 429.
PAYG credit floor — PAYG accounts must have credit_balance_cents >= 5 ($0.05) to start a request. Returns 402.
Concurrent session limit — Lua script checks ZCARD against the plan/org limit. Returns 429.
Voice RPM (sliding window) — middleware-level RPM enforcement. Returns 429.

Cost-aware sub-pipeline

Inside the voice pipeline, the LLM stage uses BaRP (the existing routing fairness layer) with a VoiceOnly constraint that prefers models with avg_latency_ms < 300. This sub-pipeline does not consume the voice RPM bucket — only the outer voice request does.

Rate limit headers for voice

Voice endpoints return the same standard headers as other endpoints:

Header	Voice meaning
`X-RateLimit-Limit`	Voice RPM limit (per-org or plan default)
`X-RateLimit-Remaining`	Voice requests remaining in the current minute
`X-RateLimit-Reset`	Unix timestamp when the voice RPM window resets
`Retry-After`	Set on 429, in seconds, when any voice limit is exceeded

For per-request voice cost + model headers, see Voice Billing and Voice Agents.

Rate limit response headers

Every response includes headers showing your current rate limit status:

Header	Description
`X-RateLimit-Limit`	Maximum requests allowed per minute
`X-RateLimit-Remaining`	Requests remaining in the current window
`X-RateLimit-Reset`	Unix timestamp when the window resets
`RateLimit-Limit`	Same as `X-RateLimit-Limit` (RFC 9333 alias)
`RateLimit-Remaining`	Same as `X-RateLimit-Remaining` (RFC 9333 alias)
`RateLimit-Reset`	Same as `X-RateLimit-Reset` (RFC 9333 alias)

When a limit is exceeded, a Retry-After header is also included:

Header	Description
`Retry-After`	Seconds to wait before retrying

Rate limit exceeded — 429 response

{
  "error": {
    "message": "Rate limit exceeded: 1000 requests per minute",
    "type": "rate_limit_error",
    "code": "rate_limit_exceeded"
  }
}

Handling rate limits

Exponential backoff with jitter (recommended)

import time
import random
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["XANTLY_API_KEY"],
    base_url="https://api.xantly.com/v1",
)

def chat_with_backoff(messages, max_retries=6):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model="auto",
                messages=messages,
            )
        except Exception as e:
            if "429" in str(e) or "rate_limit" in str(e):
                wait = (2 ** attempt) + random.uniform(0, 1)
                print(f"Rate limited. Waiting {wait:.1f}s...")
                time.sleep(wait)
            else:
                raise
    raise Exception("Max retries exceeded")

async function chatWithBackoff(messages, maxRetries = 6) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await client.chat.completions.create({ model: "auto", messages });
    } catch (err) {
      if (err?.status === 429) {
        const wait = Math.pow(2, attempt) * 1000 + Math.random() * 1000;
        console.log(`Rate limited. Waiting ${(wait / 1000).toFixed(1)}s...`);
        await new Promise((r) => setTimeout(r, wait));
      } else {
        throw err;
      }
    }
  }
  throw new Error("Max retries exceeded");
}

Inspect headers before retrying

import httpx

response = httpx.post(
    "https://api.xantly.com/v1/chat/completions",
    headers={"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"},
    json={"model": "auto", "messages": [{"role": "user", "content": "Hello"}]},
)

remaining = int(response.headers.get("X-RateLimit-Remaining", 0))
if remaining < 10:
    print(f"Warning: only {remaining} requests remaining in this window")

Best practices

Use model: "auto" — the gateway's intelligent routing can reuse cached responses and distribute load across providers, reducing your effective RPM.
Enable semantic caching (xantly.enable_cache: true, default) — cache hits don't consume inference quota.
Batch when possible — combine multiple inputs into a single embeddings request.
Monitor headers — log X-RateLimit-Remaining in production to catch approaching limits before they hit.
Use service_tier: "batch" — signals a cost/latency preference that may take advantage of off-peak capacity.

Next steps

Billing & Quotas — Token quotas and monthly budget limits
Chat Completions — Main inference endpoint