Xantly
The universal AI gateway that makes every LLM call faster, cheaper, and more reliable — without changing your code.
The universal AI gateway that makes every LLM call faster, cheaper, and more reliable — without changing your code.
Change one line — your
base_url— and get intelligent routing, caching, memory, and cost optimization across every major LLM provider. No new SDK. No vendor lock-in.
What is Xantly?
Xantly is an AI infrastructure layer that sits between your application and LLM providers. Every request flows through four integrated engines — Routing, Caching, Memory, and Learning — that work together to deliver the right model, at the right cost, with the right context, every time.
You keep using the OpenAI SDK you already know. Xantly handles everything else: model selection, provider failover, response caching, cross-session memory, and continuous optimization. One API. Every provider. Zero code changes.
Architecture
Your application sends a standard OpenAI-format request. The gateway authenticates, checks cache, routes intelligently, calls the optimal provider, streams the response back, and learns from every interaction — all transparently.
The Four Pillars
Intelligent Routing
Not every request needs a frontier model. Xantly's routing engine analyzes each request — task type, complexity, context length, required capabilities — and selects the optimal model automatically. Simple queries route to fast, cost-efficient models. Complex reasoning routes to frontier models. The result: same quality output at 40-70% lower cost.
Three routing tiers handle the full spectrum:
| Tier | Latency | Cost | Best for |
|---|---|---|---|
| Speed | 100-300ms | Lowest | Chat, classification, simple Q&A |
| Value | 300-800ms | Moderate | Code generation, summarization, extraction |
| Quality | 800-3000ms | Premium | Complex reasoning, analysis, multi-step tasks |
Use model: "auto" and let the engine decide — or specify a model directly, apply routing_hints, or select a preset. Full control when you want it, automatic when you don't.
Learn more about Intelligent Routing
Multi-Layer Caching
Xantly eliminates redundant LLM calls with two types of cache hits:
- Exact match — identical request returns in under 5ms at zero token cost. Catches retry loops, page refreshes, and duplicate triggers.
- Semantic match — similar intent, different wording returns in under 20ms at zero token cost. Catches FAQ variations, rephrased questions, and near-duplicate patterns.
For production workloads with repetitive patterns — customer support, code assistants, data pipelines — caching delivers 40-70% cost reduction automatically. Response headers (x-xantly-cache-hit, x-xantly-cache-type) tell you exactly what happened.
Learn more about Caching & Performance
Persistent Memory
Every conversation enriches a per-organization knowledge base. Xantly automatically detects sessions, extracts knowledge (entities, facts, domain patterns), and assembles relevant context for future requests — without you sending full conversation history every time.
This means:
- Fewer tokens per request — the gateway curates only the context that matters
- Cross-session continuity — your AI remembers what it learned yesterday
- Domain adaptation — the longer you use Xantly, the more precisely it understands your domain
- Smarter routing — better context enables cheaper models to handle tasks that previously required expensive ones
Memory is automatic and zero-configuration. When you want explicit control, HTTP headers and the Memory API let you scope, search, and manage stored knowledge.
Learn more about Memory & Context
Learning Engine
Every response feeds back into the system. The routing engine learns which models perform best for your specific workloads. Cache patterns adapt to your traffic. Memory extractions become more precise over time. Xantly doesn't just serve requests — it gets measurably better at serving your requests with continued use.
Request Lifecycle
Every call to /v1/chat/completions flows through six stages:
- Authenticate — API key verification, org mapping, rate limits, and budget enforcement
- Cache check — exact and semantic matching against recent responses
- Route — task classification, model selection, provider assignment
- Call provider — request execution with automatic retries, failover, and waterfall fallback
- Respond — real-time streaming with full metadata (model, cost, latency, cache status)
- Learn — async knowledge extraction, memory storage, and routing feedback (never blocks your response)
Provider Ecosystem
Xantly routes across every major LLM provider. Use Xantly-managed keys or bring your own (BYOK) for any provider.
| Provider | Example Models | Strengths |
|---|---|---|
| OpenAI | GPT-4o, GPT-4o-mini, o1 | General purpose, tool calling, vision |
| Anthropic | Claude Sonnet, Haiku | Long context, reasoning, safety |
| Gemini Flash, Pro | Speed, multimodal, large context | |
| Groq | Llama 3.3 70B | Ultra-low latency inference |
| DeepSeek | DeepSeek V3, R1 | Cost efficiency, code, math |
| Open-weight | Via OpenRouter | Maximum flexibility, 200+ models |
Use model: "auto" and the routing engine handles selection. Or specify any model directly — the gateway normalizes the provider interface so your code stays the same regardless of which model serves the request.
What Makes Xantly Different
- Zero code changes — swap
base_url, keep your OpenAI SDK code. Streaming, tool calling, structured output, function calling, vision — everything works. - Synergistic engines — routing, caching, memory, and learning aren't isolated features. They reinforce each other: memory improves routing, caching accelerates repeat patterns, learning optimizes all three over time.
- Provider-agnostic — one API across every provider. No lock-in. Switch models or providers with zero application changes.
- Full observability —
x-xantly-*response headers expose the tier used, provider, cost, latency, cache status, and routing metadata on every request. - Production-grade resilience — automatic retries, circuit breakers, waterfall fallback across providers, and multi-key rotation ensure your AI never goes down because a single provider does.
- Cost transparency — per-request cost tracking via
x-xantly-cost-usdheaders and dashboard analytics. Know exactly what you're spending and why.
Get Started in 60 Seconds
from openai import OpenAI
client = OpenAI(
base_url="https://api.xantly.com/v1", # only change
api_key="your-xantly-api-key"
)
response = client.chat.completions.create(
model="auto",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)import OpenAI from "openai"
const client = new OpenAI({
baseURL: "https://api.xantly.com/v1", // only change
apiKey: "your-xantly-api-key"
})
const response = await client.chat.completions.create({
model: "auto",
messages: [{ role: "user", content: "Hello!" }]
})
console.log(response.choices[0].message.content)curl https://api.xantly.com/v1/chat/completions \
-H "Authorization: Bearer your-xantly-api-key" \
-H "Content-Type: application/json" \
-d '{"model":"auto","messages":[{"role":"user","content":"Hello!"}]}'What's Next
- Quickstart — API key setup and your first request in under 5 minutes
- Authentication — Bearer tokens, API key scopes, and security best practices
- Platform Overview — Deep dive into the request lifecycle and provider ecosystem
- Intelligent Routing — How model selection works and how to control it
- Caching & Performance — Multi-layer caching architecture
- Memory & Context — Persistent memory and context assembly
- Chat Completions — Full API reference with all parameters
- Cost-Optimized Routing — Tune cost, latency, and quality tradeoffs
Frequently Asked Questions
What is Xantly?
Xantly is an AI infrastructure layer that routes requests to 10,000+ LLMs across 15 providers with a median added latency of 12ms. It includes intelligent routing, semantic caching (62% hit rate with sub-5ms responses), persistent per-organization memory, and automatic waterfall failover — reducing LLM API costs by up to 80% with zero code changes.
How is Xantly different from using OpenAI directly?
When you call OpenAI directly, you get one model from one provider with no fallback, no caching, and no memory. Xantly adds intelligent routing that automatically selects the cheapest model capable of handling each request, semantic caching that eliminates redundant calls (62% hit rate), persistent memory that reduces token usage over time, and waterfall failover across 15 providers — all for a median overhead of just 12ms.
Do I need to change my code to use Xantly?
No. Xantly is 100% compatible with the OpenAI SDK — streaming, tool calling, structured output, function calling, and vision all work unchanged. You swap your base_url to https://api.xantly.com/v1, replace your API key, and every existing call works immediately with routing, caching, and memory enabled automatically.
What providers does Xantly support?
Xantly routes across OpenAI, Anthropic, Google (Gemini), Groq, DeepSeek, NVIDIA, and 200+ open-weight models via OpenRouter — totaling 10,000+ model variants across 15 providers. You can use Xantly-managed API keys or bring your own (BYOK) for any provider, and the gateway normalizes every provider's interface so your code stays identical regardless of which model serves the request.