Xantly

The universal AI gateway that makes every LLM call faster, cheaper, and more reliable — without changing your code.

The universal AI gateway that makes every LLM call faster, cheaper, and more reliable — without changing your code.

Change one line — your base_url — and get intelligent routing, caching, memory, and cost optimization across every major LLM provider. No new SDK. No vendor lock-in.

What is Xantly?

Xantly is an AI infrastructure layer that sits between your application and LLM providers. Every request flows through four integrated engines — Routing, Caching, Memory, and Learning — that work together to deliver the right model, at the right cost, with the right context, every time.

You keep using the OpenAI SDK you already know. Xantly handles everything else: model selection, provider failover, response caching, cross-session memory, and continuous optimization. One API. Every provider. Zero code changes.

Architecture

Your application sends a standard OpenAI-format request. The gateway authenticates, checks cache, routes intelligently, calls the optimal provider, streams the response back, and learns from every interaction — all transparently.

The Four Pillars

Intelligent Routing

Not every request needs a frontier model. Xantly's routing engine analyzes each request — task type, complexity, context length, required capabilities — and selects the optimal model automatically. Simple queries route to fast, cost-efficient models. Complex reasoning routes to frontier models. The result: same quality output at 40-70% lower cost.

Three routing tiers handle the full spectrum:

Tier	Latency	Cost	Best for
Speed	100-300ms	Lowest	Chat, classification, simple Q&A
Value	300-800ms	Moderate	Code generation, summarization, extraction
Quality	800-3000ms	Premium	Complex reasoning, analysis, multi-step tasks

Use model: "auto" and let the engine decide — or specify a model directly, apply routing_hints, or select a preset. Full control when you want it, automatic when you don't.

Learn more about Intelligent Routing

Multi-Layer Caching

Xantly eliminates redundant LLM calls with two types of cache hits:

Exact match — identical request returns in under 5ms at zero token cost. Catches retry loops, page refreshes, and duplicate triggers.
Semantic match — similar intent, different wording returns in under 20ms at zero token cost. Catches FAQ variations, rephrased questions, and near-duplicate patterns.

For production workloads with repetitive patterns — customer support, code assistants, data pipelines — caching delivers 40-70% cost reduction automatically. Response headers (x-xantly-cache-hit, x-xantly-cache-type) tell you exactly what happened.

Learn more about Caching & Performance

Persistent Memory

Every conversation enriches a per-organization knowledge base. Xantly automatically detects sessions, extracts knowledge (entities, facts, domain patterns), and assembles relevant context for future requests — without you sending full conversation history every time.

This means:

Fewer tokens per request — the gateway curates only the context that matters
Cross-session continuity — your AI remembers what it learned yesterday
Domain adaptation — the longer you use Xantly, the more precisely it understands your domain
Smarter routing — better context enables cheaper models to handle tasks that previously required expensive ones

Memory is automatic and zero-configuration. When you want explicit control, HTTP headers and the Memory API let you scope, search, and manage stored knowledge.

Learn more about Memory & Context

Learning Engine

Every response feeds back into the system. The routing engine learns which models perform best for your specific workloads. Cache patterns adapt to your traffic. Memory extractions become more precise over time. Xantly doesn't just serve requests — it gets measurably better at serving your requests with continued use.

Request Lifecycle

Every call to /v1/chat/completions flows through six stages:

Authenticate — API key verification, org mapping, rate limits, and budget enforcement
Cache check — exact and semantic matching against recent responses
Route — task classification, model selection, provider assignment
Call provider — request execution with automatic retries, failover, and waterfall fallback
Respond — real-time streaming with full metadata (model, cost, latency, cache status)
Learn — async knowledge extraction, memory storage, and routing feedback (never blocks your response)

Provider Ecosystem

Xantly routes across every major LLM provider. Use Xantly-managed keys or bring your own (BYOK) for any provider.

Provider	Example Models	Strengths
OpenAI	GPT-4o, GPT-4o-mini, o1	General purpose, tool calling, vision
Anthropic	Claude Sonnet, Haiku	Long context, reasoning, safety
Google	Gemini Flash, Pro	Speed, multimodal, large context
Groq	Llama 3.3 70B	Ultra-low latency inference
DeepSeek	DeepSeek V3, R1	Cost efficiency, code, math
Open-weight	Via OpenRouter	Maximum flexibility, 200+ models

Use model: "auto" and the routing engine handles selection. Or specify any model directly — the gateway normalizes the provider interface so your code stays the same regardless of which model serves the request.

What Makes Xantly Different

Zero code changes — swap base_url, keep your OpenAI SDK code. Streaming, tool calling, structured output, function calling, vision — everything works.
Synergistic engines — routing, caching, memory, and learning aren't isolated features. They reinforce each other: memory improves routing, caching accelerates repeat patterns, learning optimizes all three over time.
Provider-agnostic — one API across every provider. No lock-in. Switch models or providers with zero application changes.
Full observability — x-xantly-* response headers expose the tier used, provider, cost, latency, cache status, and routing metadata on every request.
Production-grade resilience — automatic retries, circuit breakers, waterfall fallback across providers, and multi-key rotation ensure your AI never goes down because a single provider does.
Cost transparency — per-request cost tracking via x-xantly-cost-usd headers and dashboard analytics. Know exactly what you're spending and why.

Get Started in 60 Seconds

from openai import OpenAI

client = OpenAI(
    base_url="https://api.xantly.com/v1",  # only change
    api_key="your-xantly-api-key"
)

response = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

import OpenAI from "openai"

const client = new OpenAI({
  baseURL: "https://api.xantly.com/v1",  // only change
  apiKey: "your-xantly-api-key"
})

const response = await client.chat.completions.create({
  model: "auto",
  messages: [{ role: "user", content: "Hello!" }]
})
console.log(response.choices[0].message.content)

curl https://api.xantly.com/v1/chat/completions \
  -H "Authorization: Bearer your-xantly-api-key" \
  -H "Content-Type: application/json" \
  -d '{"model":"auto","messages":[{"role":"user","content":"Hello!"}]}'

What's Next

Quickstart — API key setup and your first request in under 5 minutes
Authentication — Bearer tokens, API key scopes, and security best practices
Platform Overview — Deep dive into the request lifecycle and provider ecosystem
Intelligent Routing — How model selection works and how to control it
Caching & Performance — Multi-layer caching architecture
Memory & Context — Persistent memory and context assembly
Chat Completions — Full API reference with all parameters
Cost-Optimized Routing — Tune cost, latency, and quality tradeoffs

Frequently Asked Questions

What is Xantly?

Xantly is an AI infrastructure layer that routes requests to 10,000+ LLMs across 15 providers with a median added latency of 12ms. It includes intelligent routing, semantic caching (62% hit rate with sub-5ms responses), persistent per-organization memory, and automatic waterfall failover — reducing LLM API costs by up to 80% with zero code changes.

How is Xantly different from using OpenAI directly?

When you call OpenAI directly, you get one model from one provider with no fallback, no caching, and no memory. Xantly adds intelligent routing that automatically selects the cheapest model capable of handling each request, semantic caching that eliminates redundant calls (62% hit rate), persistent memory that reduces token usage over time, and waterfall failover across 15 providers — all for a median overhead of just 12ms.

Do I need to change my code to use Xantly?

No. Xantly is 100% compatible with the OpenAI SDK — streaming, tool calling, structured output, function calling, and vision all work unchanged. You swap your base_url to https://api.xantly.com/v1, replace your API key, and every existing call works immediately with routing, caching, and memory enabled automatically.

What providers does Xantly support?

Xantly routes across OpenAI, Anthropic, Google (Gemini), Groq, DeepSeek, NVIDIA, and 200+ open-weight models via OpenRouter — totaling 10,000+ model variants across 15 providers. You can use Xantly-managed API keys or bring your own (BYOK) for any provider, and the gateway normalizes every provider's interface so your code stays identical regardless of which model serves the request.

On this page