Intelligent Routing

Xantly's routing engine automatically selects the optimal model for each request based on task complexity, latency requirements, cost constraints, and historical performance. No configuration required

Why routing matters

LLMs vary dramatically in cost, speed, and capability:

	Speed Tier	Value Tier	Quality Tier
Latency	100-300ms	300-800ms	800-3000ms
Cost	$0.10-0.50/M tokens	$0.50-3.00/M tokens	$3.00-15.00/M tokens
Best for	Simple queries, chat, classification	Code, summarization, extraction	Complex reasoning, analysis

Sending every request to GPT-4o wastes money on simple tasks. Sending everything to a cheap model degrades quality on hard ones. Intelligent routing solves this by matching each request to the right tier automatically.

How auto-routing works

When you send model: "auto", the gateway runs a three-step process:

Step 1: Task analysis

The request is classified by type and complexity:

Task type: Code generation, factual QA, creative writing, structured extraction, tool calling, SQL, reasoning, or general-purpose
Complexity signals: Message length, conversation depth, schema requirements, tool definitions, required capabilities

This classification happens in under 2ms and never adds perceptible latency.

Step 2: Model selection

A machine learning model evaluates the classified request against the available model pool. It considers:

Task-model fit: Which models perform best for this task type
Latency budget: How fast the response needs to be
Cost efficiency: The cheapest model that meets quality requirements
Memory context: Whether cached context reduces the need for a frontier model
Provider health: Current availability and latency of each provider

The selection model improves continuously — it learns from every request which model performs best for which task type and tenant workload.

Step 3: Execution with fallback

The selected model receives the request. If the primary provider fails or times out, the gateway automatically falls back to the next best option. This waterfall mechanism ensures high availability without any client-side retry logic.

Request → Primary model → ✓ Response
                        → ✗ Timeout → Fallback model → ✓ Response
                                                      → ✗ Error → Second fallback → Response

Routing tiers

Every model in the catalog is assigned to a tier:

Speed tier

Fast, cost-efficient models for straightforward tasks. Typical latency: 100-300ms.

Best for: chat, classification, simple factual questions, translation
When auto-routing selects this: low complexity, no tool calls, short messages

Value tier

Balanced models that handle most production workloads. Typical latency: 300-800ms.

Best for: code generation, summarization, data extraction, moderate reasoning
When auto-routing selects this: medium complexity, standard tool calls, typical business queries

Quality tier

Frontier models for the hardest tasks. Typical latency: 800-3000ms.

Best for: multi-step reasoning, complex analysis, creative writing, long-context synthesis
When auto-routing selects this: high complexity, deep reasoning required, schema-heavy structured output

Controlling routing

Auto-routing works out of the box, but you can influence or override it.

Presets (simplest)

{
  "model": "auto",
  "routing_hints": { "mode": "cost_optimized" }
}

Mode	Effect
`fast`	Bias toward speed tier
`balanced`	Let the engine decide (default)
`quality`	Bias toward quality tier
`cost_optimized`	Minimize cost aggressively

Fine-grained hints

{
  "model": "auto",
  "routing_hints": {
    "preference_dial": 0.2,
    "max_latency_ms": 500,
    "task_complexity": "standard"
  }
}

preference_dial (0.0-1.0): Lower = cheaper/faster, higher = quality. Default: 0.5.
max_latency_ms: Soft latency budget. Values under 500ms strongly bias toward speed tier.
task_complexity: trivial, standard, complex, expert. Overrides automatic complexity detection.

Direct model selection

Skip routing entirely by specifying a model:

{
  "model": "openai/gpt-4o",
  "messages": [...]
}

This bypasses the routing engine and sends directly to the specified model.

Continuous learning

The routing engine improves with every request:

Each response generates a feedback signal — latency, cost, success/failure, and quality indicators
The model selection algorithm updates online — no batch retraining, no downtime
Per-tenant adaptation — the system learns your specific workload patterns
Drift detection — if a model's performance changes, the router adapts automatically

After processing a few hundred requests, the routing engine is typically 30-50% more cost-efficient than static model selection — while maintaining equivalent or better output quality.

Response headers

Every response includes routing metadata:

Header	Description
`x-xantly-tier-used`	`T1` (quality), `T2` (value), or `T3` (speed)
`x-xantly-lane-used`	`turbo` or `smart`
`x-xantly-provider`	Which provider served the request
`x-xantly-model`	Exact model identifier
`x-xantly-cost-usd`	Cost of this specific request
`x-xantly-latency-ms`	End-to-end latency

Use these headers to monitor routing decisions and tune your hints.

What's next

Caching & Performance — How caching eliminates redundant LLM calls
Cost-Optimized Routing Guide — Advanced routing patterns and controls
Memory & Context — How memory makes routing smarter over time

Frequently Asked Questions

How does intelligent routing work?

When you send model: "auto", the routing engine analyzes 15 request parameters — including task type, complexity, message length, conversation depth, schema requirements, tool definitions, and required capabilities — classifies the request in under 2ms, and selects the optimal model from the available pool based on task-model fit, latency budget, cost efficiency, memory context, and real-time provider health.

What routing modes are available?

Xantly offers six routing modes: auto (the engine decides based on request analysis), fast (biases toward speed-tier models with 100-300ms latency), balanced (default, lets the engine weigh all factors), quality (biases toward frontier models for complex reasoning), cost_optimized (minimizes cost aggressively), and free_models_only (routes exclusively to zero-cost open-weight models). You can also fine-tune with a preference_dial (0.0-1.0) for continuous control between cost and quality.

Can I force a specific model?

Yes. Set the model field to any provider/model identifier (e.g., openai/gpt-4o or anthropic/claude-3.5-sonnet) and the request bypasses the routing engine entirely, going directly to that specific model. You can also use routing_override.model for hard overrides while keeping other routing features like failover and caching active.

How reliable is the routing?

Xantly's routing engine achieves 99.999% routing reliability through waterfall fallback across 15 providers with 36+ API keys, automatic circuit breakers that proactively route around degraded providers, and multi-key rotation that handles billing exhaustion without request failures. If the primary model fails or times out, the gateway automatically falls back to the next best option — ensuring high availability with zero client-side retry logic.

On this page