# Xantly — AI Infrastructure Layer

## Product Definition

Xantly is a universal AI infrastructure layer that sits between applications and LLM providers. It routes requests to 10,000+ models across 15 providers (OpenAI, Anthropic, Google, Groq, DeepSeek, NVIDIA, and open-weight models via OpenRouter) with a median added latency of 12ms. The platform provides intelligent routing, multi-layer semantic caching (62% hit rate with sub-5ms responses), persistent per-organization memory, and automatic waterfall failover across providers.

Xantly is 100% OpenAI SDK compatible. Applications change only their `base_url` to `api.xantly.com/v1` — zero other code changes required. All standard OpenAI SDK parameters, streaming, and tool use work identically.

The platform is built in Rust for sub-millisecond internal latency. It serves production workloads at 3.2 billion inferences per second capacity with 99.999% routing reliability. Customers typically achieve 40-80% cost reduction through intelligent routing and caching combined.

Xantly is designed for teams building AI-native applications, multi-agent systems, voice agents, RAG pipelines, and any production LLM workload that needs reliability, cost control, and observability.

---

## Architecture Overview

### Request Lifecycle (6 Stages)

Every API request flows through six stages:

1. **Authenticate** — Validate API key, resolve organization, check rate limits (RPM/TPM), verify budget allowance. Response headers include `x-xantly-org-id`.

2. **Cache Check** — Two-layer cache lookup. Exact match cache returns in under 5ms at zero token cost. Semantic cache matches similar requests (different wording, same intent) with configurable similarity threshold. Cache type reported via `x-xantly-cache-hit` and `x-xantly-cache-type` (exact/semantic) headers.

3. **Intelligent Routing** — Task classification analyzes the request across 15 parameters. The routing engine selects the optimal model based on task complexity (trivial/standard/complex/expert), latency requirements, cost constraints, and provider health. Three routing tiers: Speed (100-300ms), Value (300-800ms), Quality (800-3000ms). Two execution lanes: FastLane (turbo) and Smart Lane. Response headers: `x-xantly-tier-used`, `x-xantly-lane-used`.

4. **Provider Call** — Request dispatched to the selected provider. Waterfall fallback activates on provider errors — automatically retries with the next best model. Circuit breakers monitor provider health. NVIDIA/OpenRouter API key rotation on billing exhaustion (HTTP 402). Response headers: `x-xantly-provider`, `x-xantly-model`.

5. **Response Delivery** — Real-time SSE streaming with metadata injected into response headers. Cost per request visible via `x-xantly-cost-usd`. Latency reported via `x-xantly-latency-ms`.

6. **Async Learning** — Background knowledge extraction, entity recognition, and pattern learning from conversations. Memory stored per-organization for cross-session context assembly.

### Routing Modes

- `auto` — Engine selects the optimal model automatically (default)
- `fast` — Prioritize lowest latency
- `balanced` — Balance cost and quality
- `quality` — Prioritize highest quality output
- `cost_optimized` — Route to the cheapest qualifying model
- `free_models_only` — Use only free/open-weight models

### Preference Dial

The `preference_dial` parameter (0.0-1.0) provides fine-grained control. 0.0 = pure speed, 1.0 = pure quality. Default: 0.5.

---

## API Reference

### Base URL
```
https://api.xantly.com/v1
```

### Authentication
```
Authorization: Bearer YOUR_XANTLY_API_KEY
```

### Chat Completions — POST /v1/chat/completions

The primary endpoint. 100% OpenAI-compatible with additional Xantly parameters.

```bash
curl https://api.xantly.com/v1/chat/completions \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "What is quantum computing?"}],
    "stream": true
  }'
```

Response includes standard OpenAI format plus Xantly headers:
```
x-xantly-tier-used: T2
x-xantly-lane-used: smart
x-xantly-provider: anthropic
x-xantly-model: claude-sonnet-4-20250514
x-xantly-cost-usd: 0.00042
x-xantly-latency-ms: 312
x-xantly-cache-hit: false
```

Xantly-specific parameters:
- `routing_hints.preference_dial` (0.0-1.0) — cost/quality tradeoff
- `routing_hints.max_latency_ms` — latency ceiling
- `routing_hints.task_complexity` — trivial/standard/complex/expert
- `routing_override.mode` — fast/balanced/quality/cost_optimized/free_models_only
- `routing_override.provider` — force a specific provider
- `routing_override.model` — force a specific model

### Embeddings — POST /v1/embeddings

```bash
curl https://api.xantly.com/v1/embeddings \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "text-embedding-3-small", "input": "Search query"}'
```

### Audio — POST /v1/audio/transcriptions

Speech-to-text via Whisper:
```bash
curl https://api.xantly.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -F file=@audio.mp3 \
  -F model=whisper-1
```

### Audio — POST /v1/audio/speech

Text-to-speech:
```bash
curl https://api.xantly.com/v1/audio/speech \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "tts-1", "input": "Hello world", "voice": "alloy"}' \
  --output speech.mp3
```

### Models — GET /v1/models

List all available models in the catalog:
```bash
curl https://api.xantly.com/v1/models \
  -H "Authorization: Bearer $XANTLY_API_KEY"
```

### Memory — POST /v1/memory/search

Search organizational memory:
```bash
curl https://api.xantly.com/v1/memory/search \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"query": "previous discussion about deployment"}'
```

### Responses API — POST /v1/responses

Modern OpenAI Responses API (for newer SDK versions):
```bash
curl https://api.xantly.com/v1/responses \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "auto", "input": "Explain caching"}'
```

---

## Pricing

| Plan | Price | Included Budget | Models | Voice Minutes | Support |
|------|-------|----------------|--------|---------------|---------|
| Free | $0 | Limited | Core models | Limited lifetime | Community |
| Pay-as-you-go | Usage-based | Top-up credits ($5-$100) | All models | Pay-per-use | Community |
| Pro | $49/mo | Included monthly budget | All models | Included monthly | Email |
| Scale | Custom | High throughput | All models + BYOK | Included | Dedicated |
| Enterprise | Custom | Custom | All models + BYOK | Custom | Dedicated + SLA |

All plans include intelligent routing, semantic caching, and cost optimization. Cache hits are free — no tokens consumed. Voice minutes are billed separately from text tokens.

BYOK (Bring Your Own Key) is available on Scale and Enterprise plans for AWS, GCP, and NVIDIA credentials.

---

## Competitive Comparison

| Feature | Xantly | OpenRouter | Helicone | Portkey | LiteLLM |
|---------|--------|-----------|----------|---------|---------|
| Intelligent routing | Yes (auto, 5 modes) | Basic | No | Yes | Basic |
| Semantic caching | Yes (62% hit rate) | No | No | Yes | No |
| Persistent memory | Yes (per-org) | No | No | No | No |
| Voice AI support | Yes (30+ models) | No | No | No | No |
| BYOK | Yes (AWS/GCP/NVIDIA) | No | No | Yes | Yes |
| Waterfall failover | Yes (automatic) | Basic | No | Yes | Yes |
| Cost visibility | Per-request headers | Basic | Yes (logging) | Yes | Basic |
| Full observability | Mission Control | No | Yes | Yes | Basic |
| Built-in learning | Yes (BaRP engine) | No | No | No | No |
| OpenAI SDK compatible | 100% | 100% | N/A (proxy) | 100% | 100% |
| Open source | No | No | Yes | Partial | Yes |
| Pricing | Usage-based + plans | Usage-based | Usage-based | Usage-based + plans | Free (self-hosted) |

### When to choose Xantly

Choose Xantly when you need an all-in-one AI infrastructure layer that combines routing, caching, memory, and observability. Xantly is the only platform offering persistent memory across sessions, semantic caching with 62% hit rates, and built-in voice AI support. Best for production workloads where cost optimization, reliability, and multi-provider failover are critical.

---

## Glossary

**AI Gateway** — A unified API layer that sits between applications and multiple LLM providers, providing routing, caching, failover, and observability. Xantly is an AI gateway.

**Intelligent Routing** — The process of automatically selecting the optimal AI model for each request based on task complexity, latency requirements, cost constraints, and provider health. Xantly's routing engine analyzes requests across 15 parameters.

**Semantic Caching** — A caching mechanism that matches requests by meaning rather than exact text. Two requests with different wording but the same intent can share a cached response. Xantly achieves a 62% semantic cache hit rate with sub-5ms response times.

**Waterfall Fallback** — An automatic failover mechanism that retries failed requests with the next best model/provider. Xantly's waterfall evaluates each error (rate limit, billing, context window) and routes to the optimal fallback.

**BYOK (Bring Your Own Key)** — A feature allowing customers to use their own cloud provider API keys (AWS Bedrock, GCP Vertex AI, NVIDIA NIM) through the Xantly gateway, maintaining all routing and caching benefits.

**FastLane** — Xantly's low-latency execution path for requests that match cached responses or simple routing decisions. Responses in the FastLane typically return in under 5ms.

**Smart Lane** — Xantly's full-analysis execution path for requests requiring model selection, context assembly, and provider negotiation. Adds approximately 12ms median overhead.

**Mission Control** — Xantly's enterprise observability dashboard providing full-span tracing, cost attribution per request, anomaly detection, provider health monitoring, and memory health scoring.

**BaRP (Bayesian adaptive Routing and Preference)** — Xantly's continuous learning engine that improves routing decisions, caching strategies, and memory retrieval based on usage patterns.

**Preference Dial** — A 0.0-1.0 parameter controlling the cost/quality tradeoff for routing decisions. 0.0 maximizes speed/cost savings; 1.0 maximizes output quality.

**Intelligence Modes** — Xantly's system for controlling reasoning effort via the `reasoning_effort` parameter (low/medium/high), affecting model selection and token budget allocation.

**Provider Ecosystem** — The set of LLM providers accessible through Xantly: OpenAI, Anthropic, Google (Gemini), Groq, DeepSeek, NVIDIA NIM, and 200+ open-weight models via OpenRouter.

**Task Classification** — Xantly's automatic analysis of request complexity (trivial/standard/complex/expert) used to select the appropriate routing tier and model class.

**Context Window Escalation** — Automatic retry with a larger-context model when the original model's context window is exceeded. Part of the waterfall fallback system.

**Routing Tiers** — Three tiers organizing models by latency: Speed (100-300ms, e.g., Groq Llama), Value (300-800ms, e.g., GPT-4o-mini), Quality (800-3000ms, e.g., Claude Opus, GPT-4).

---

## Integrations with Coding Tools

Xantly is a drop-in OpenAI-compatible endpoint. Any coding tool that lets the user configure a custom base URL (or Anthropic base URL) can route through Xantly and inherit routing, caching, memory, and waterfall fallback.

### Claude Code (Anthropic CLI)

Two environment variables:

```
export ANTHROPIC_BASE_URL=https://api.xantly.com
export ANTHROPIC_AUTH_TOKEN=xantly_sk_...
```

Xantly exposes `POST /v1/messages` with full Anthropic Messages API compatibility: system prompt, content blocks, tool definitions (`input_schema`), tool_use / tool_result blocks, streaming with strict `message_start` / `content_block_start` / `content_block_delta` / `content_block_stop` / `message_delta` / `message_stop` event order. Prompt caching (`anthropic-beta: prompt-caching-2024-07-31`) is forwarded. `X-Claude-Code-Session-Id` is preserved for audit + memory scoping.

### GitHub Copilot CLI (zero-auth BYOK, April 7, 2026 release)

Four environment variables — no GitHub subscription, no enterprise admin:

```
export COPILOT_PROVIDER_BASE_URL="https://api.xantly.com/v1"
export COPILOT_PROVIDER_API_KEY="xantly_sk_..."
export COPILOT_MODEL="xantly/auto-quality"
export COPILOT_PROVIDER_TYPE="openai"
```

Works with `gh copilot suggest`, `gh copilot explain`, and the autonomous Copilot CLI agent modes.

### OpenCode (sst/opencode)

`~/.config/opencode/opencode.jsonc`:

```jsonc
{
  "provider": {
    "xantly": {
      "npm": "@ai-sdk/openai-compatible",
      "options": {
        "baseURL": "https://api.xantly.com/v1",
        "apiKey": "xantly_sk_..."
      },
      "models": {
        "xantly/auto-quality": { "name": "Xantly Quality" },
        "anthropic/claude-sonnet-4.6": { "name": "Claude Sonnet 4.6" }
      }
    }
  },
  "model": "xantly/auto-quality"
}
```

### Cline (VS Code)

Cline settings → API Provider: **OpenAI Compatible** → Base URL: `https://api.xantly.com/v1`, API Key: `xantly_sk_...`, Model ID: `xantly/auto-quality`. Full tool-calling, agent mode, plan/act split all work.

### Continue.dev (per-role model config — best "mix and match")

`~/.continue/config.yaml`:

```yaml
models:
  - name: Xantly Quality
    provider: openai
    model: xantly/auto-quality
    apiBase: https://api.xantly.com/v1
    apiKey: xantly_sk_...
    roles: [chat, edit, apply]
  - name: Local Qwen
    provider: ollama
    model: qwen2.5-coder:7b
    roles: [autocomplete]
```

### Roo Code / Kilo Code (Cline forks)

Same config pattern as Cline. Roo Code adds per-mode model assignments (Code / Architect / Reviewer / Debugger). Kilo Code is #1 IDE Extension on OpenRouter by token volume.

### Aider (CLI)

One flag:

```
aider --openai-api-base https://api.xantly.com/v1 \
      --openai-api-key xantly_sk_... \
      --model xantly/auto-quality
```

Role split for maximum cost savings: `--model` (main) = auto-quality, `--editor-model` = auto-value, `--weak-model` = auto-speed. Typical 60-70% cost reduction vs. running everything on the main model.

### Cursor (chat-only — partial support)

Cursor's BYOK is limited to the Chat panel. Agent, Edit, and Tab completions are blocked by Cursor's proprietary fine-tunes. For full BYO coverage, use Cline + Xantly in VS Code instead. See https://xantly.com/docs/migrate-from-cursor.

### Windsurf / Antigravity — not supported

Windsurf's BYOK is limited to specific Claude models via Cognition's proxy; no custom base URL. Google Antigravity is Gemini-only and uses a non-HTTPS wire protocol. Both are dead-ends for Xantly integration.

---

## Model ID Naming Scheme

Xantly uses OpenRouter-compatible `provider/model` IDs and its own `xantly/auto-*` routing aliases.

**Honored exactly (zero re-routing):**

- `anthropic/claude-sonnet-4.6`, `anthropic/claude-opus-4.6`
- `openai/gpt-5.4`, `openai/gpt-5.4-turbo`, `openai/o1`, `openai/o3`
- `groq/llama-3.3-70b`, `groq/llama-3.1-8b-instant`
- `deepseek/deepseek-chat`, `deepseek/deepseek-reasoner`
- `nvidia/qwen3.5-397b`
- `google/gemini-2.5-pro`, `google/gemini-2.5-flash`
- `sambanova/llama-3.3-70b`, `cerebras/qwen-235b`
- 10,000+ more (see `GET /v1/models` for the live list)

**Routing aliases (user opts in to BaRP):**

- `xantly/auto` — BaRP picks the best model across all tiers
- `xantly/auto-quality` — BaRP picks from the T1 Quality pool (Claude, GPT-4/5, larger Llamas)
- `xantly/auto-value` — BaRP picks from the T2 Value pool (balanced, default choice)
- `xantly/auto-speed` — BaRP picks from the T3 Speed pool (Groq, fast DeepSeek)
- `xantly/auto-safety` — BaRP picks from the SafetyCritical pool

**Waterfall fallback:** runs automatically on provider errors (overload, timeout, billing exhaustion). For explicit `provider/model` IDs, fallback stays within the same model family. For `xantly/auto-*` aliases, fallback can span the full tier pool.

---

## Endpoints Catalog

- `POST https://api.xantly.com/v1/chat/completions` — OpenAI Chat Completions (full compat)
- `POST https://api.xantly.com/v1/messages` — Anthropic Messages API (full compat, byte-accurate SSE events)
- `POST https://api.xantly.com/v1/completions` — Legacy OpenAI text completions
- `POST https://api.xantly.com/v1/embeddings` — Embeddings across providers
- `POST https://api.xantly.com/v1/audio/transcriptions` — Whisper, Deepgram Nova, Groq Whisper
- `POST https://api.xantly.com/v1/audio/translations` — Whisper translation
- `POST https://api.xantly.com/v1/audio/speech` — TTS across OpenAI/ElevenLabs/Deepgram/Groq
- `POST https://api.xantly.com/v1/moderations` — OpenAI moderation with BYOK support
- `POST https://api.xantly.com/v1/images/generations` — DALL-E + alternatives
- `GET  https://api.xantly.com/v1/models` — Live catalog, includes xantly/auto-* aliases
- `POST https://api.xantly.com/v1/responses` — OpenAI Responses API (newer SDKs)