Intelligent Routing
Xantly's routing engine automatically selects the optimal model for each request based on task complexity, latency requirements, cost constraints, and historical performance. No configuration required
Xantly's routing engine automatically selects the optimal model for each request based on task complexity, latency requirements, cost constraints, and historical performance. No configuration required — but full control when you want it.
Why routing matters
LLMs vary dramatically in cost, speed, and capability:
| Speed Tier | Value Tier | Quality Tier | |
|---|---|---|---|
| Latency | 100-300ms | 300-800ms | 800-3000ms |
| Cost | $0.10-0.50/M tokens | $0.50-3.00/M tokens | $3.00-15.00/M tokens |
| Best for | Simple queries, chat, classification | Code, summarization, extraction | Complex reasoning, analysis |
Sending every request to GPT-4o wastes money on simple tasks. Sending everything to a cheap model degrades quality on hard ones. Intelligent routing solves this by matching each request to the right tier automatically.
How auto-routing works
When you send model: "auto", the gateway runs a three-step process:
Step 1: Task analysis
The request is classified by type and complexity:
- Task type: Code generation, factual QA, creative writing, structured extraction, tool calling, SQL, reasoning, or general-purpose
- Complexity signals: Message length, conversation depth, schema requirements, tool definitions, required capabilities
This classification happens in under 2ms and never adds perceptible latency.
Step 2: Model selection
A machine learning model evaluates the classified request against the available model pool. It considers:
- Task-model fit: Which models perform best for this task type
- Latency budget: How fast the response needs to be
- Cost efficiency: The cheapest model that meets quality requirements
- Memory context: Whether cached context reduces the need for a frontier model
- Provider health: Current availability and latency of each provider
The selection model improves continuously — it learns from every request which model performs best for which task type and tenant workload.
Step 3: Execution with fallback
The selected model receives the request. If the primary provider fails or times out, the gateway automatically falls back to the next best option. This waterfall mechanism ensures high availability without any client-side retry logic.
Request → Primary model → ✓ Response
→ ✗ Timeout → Fallback model → ✓ Response
→ ✗ Error → Second fallback → ResponseRouting tiers
Every model in the catalog is assigned to a tier:
Speed tier
Fast, cost-efficient models for straightforward tasks. Typical latency: 100-300ms.
- Best for: chat, classification, simple factual questions, translation
- When auto-routing selects this: low complexity, no tool calls, short messages
Value tier
Balanced models that handle most production workloads. Typical latency: 300-800ms.
- Best for: code generation, summarization, data extraction, moderate reasoning
- When auto-routing selects this: medium complexity, standard tool calls, typical business queries
Quality tier
Frontier models for the hardest tasks. Typical latency: 800-3000ms.
- Best for: multi-step reasoning, complex analysis, creative writing, long-context synthesis
- When auto-routing selects this: high complexity, deep reasoning required, schema-heavy structured output
Controlling routing
Auto-routing works out of the box, but you can influence or override it.
Presets (simplest)
{
"model": "auto",
"routing_hints": { "mode": "cost_optimized" }
}| Mode | Effect |
|---|---|
fast | Bias toward speed tier |
balanced | Let the engine decide (default) |
quality | Bias toward quality tier |
cost_optimized | Minimize cost aggressively |
Fine-grained hints
{
"model": "auto",
"routing_hints": {
"preference_dial": 0.2,
"max_latency_ms": 500,
"task_complexity": "standard"
}
}preference_dial(0.0-1.0): Lower = cheaper/faster, higher = quality. Default: 0.5.max_latency_ms: Soft latency budget. Values under 500ms strongly bias toward speed tier.task_complexity:trivial,standard,complex,expert. Overrides automatic complexity detection.
Direct model selection
Skip routing entirely by specifying a model:
{
"model": "openai/gpt-4o",
"messages": [...]
}This bypasses the routing engine and sends directly to the specified model.
Continuous learning
The routing engine improves with every request:
- Each response generates a feedback signal — latency, cost, success/failure, and quality indicators
- The model selection algorithm updates online — no batch retraining, no downtime
- Per-tenant adaptation — the system learns your specific workload patterns
- Drift detection — if a model's performance changes, the router adapts automatically
After processing a few hundred requests, the routing engine is typically 30-50% more cost-efficient than static model selection — while maintaining equivalent or better output quality.
Response headers
Every response includes routing metadata:
| Header | Description |
|---|---|
x-xantly-tier-used | T1 (quality), T2 (value), or T3 (speed) |
x-xantly-lane-used | turbo or smart |
x-xantly-provider | Which provider served the request |
x-xantly-model | Exact model identifier |
x-xantly-cost-usd | Cost of this specific request |
x-xantly-latency-ms | End-to-end latency |
Use these headers to monitor routing decisions and tune your hints.
What's next
- Caching & Performance — How caching eliminates redundant LLM calls
- Cost-Optimized Routing Guide — Advanced routing patterns and controls
- Memory & Context — How memory makes routing smarter over time
Frequently Asked Questions
How does intelligent routing work?
When you send model: "auto", the routing engine analyzes 15 request parameters — including task type, complexity, message length, conversation depth, schema requirements, tool definitions, and required capabilities — classifies the request in under 2ms, and selects the optimal model from the available pool based on task-model fit, latency budget, cost efficiency, memory context, and real-time provider health.
What routing modes are available?
Xantly offers six routing modes: auto (the engine decides based on request analysis), fast (biases toward speed-tier models with 100-300ms latency), balanced (default, lets the engine weigh all factors), quality (biases toward frontier models for complex reasoning), cost_optimized (minimizes cost aggressively), and free_models_only (routes exclusively to zero-cost open-weight models). You can also fine-tune with a preference_dial (0.0-1.0) for continuous control between cost and quality.
Can I force a specific model?
Yes. Set the model field to any provider/model identifier (e.g., openai/gpt-4o or anthropic/claude-3.5-sonnet) and the request bypasses the routing engine entirely, going directly to that specific model. You can also use routing_override.model for hard overrides while keeping other routing features like failover and caching active.
How reliable is the routing?
Xantly's routing engine achieves 99.999% routing reliability through waterfall fallback across 15 providers with 36+ API keys, automatic circuit breakers that proactively route around degraded providers, and multi-key rotation that handles billing exhaustion without request failures. If the primary model fails or times out, the gateway automatically falls back to the next best option — ensuring high availability with zero client-side retry logic.
Platform Overview
Xantly is an AI infrastructure layer that sits between your application and LLM providers. One API call. Automatic routing. Built-in caching. Persistent memory. Zero lock-in.
Caching & Performance
Xantly's multi-layer caching infrastructure eliminates redundant LLM calls. Identical requests return instantly. Similar requests return from semantic cache. Agentic workflows with repetitive patterns