Guides
Guide: Cost-Optimized Routing
Use this guide to reduce spend while keeping output quality predictable.
Use this guide to reduce spend while keeping output quality predictable.
Routing is automatic by default (model: "auto"). You only need routing controls when you want stricter cost/latency behavior.
Start with auto-routing
Use this as your baseline:
{
"model": "auto",
"messages": [{"role": "user", "content": "Summarize this customer feedback."}]
}Then inspect response headers:
x-xantly-tier-usedx-xantly-lane-usedx-xantly-providerx-xantly-cost-usd(when usage is available)x-xantly-cache-hit
Routing controls, ordered by strength
1) Presets (routing_hints.mode)
Simple one-knob routing:
| Mode | Typical intent |
|---|---|
fast | Favor speed / lower latency |
balanced | Let gateway balance speed, quality, and cost |
quality | Favor stronger-quality paths |
cost_optimized | Favor lower-cost paths |
free_models_only | Prefer free-tier style behavior |
"routing_hints": { "mode": "cost_optimized" }2) Fine-grained hints (routing_hints.*)
Use these when preset behavior is not enough.
| Field | Status | Notes |
|---|---|---|
preference_dial | Active | Clamped to 0.0..1.0. Lower = cheaper/faster bias; higher = quality bias. |
prefer_latency | Active | Strong low-latency bias. |
max_latency_ms | Active | Very low values (for example <500) strongly bias fast routing. |
task_complexity | Active | trivial, standard, complex, expert. |
max_tier | Active (best-effort) | Tier guardrail hint. |
chain_routing | Active | sticky or mixed; mixed disables sticky continuation behavior. |
allow_free_fallback | Active | Adds explicit fallback preference signal. |
prefer_quality | Partial | Accepted; currently mainly affects how presets are applied. |
max_cost_per_token | Advisory | Accepted for compatibility; not a strict enforcement path. |
required_capabilities | Advisory | Accepted for compatibility; not a strict hard filter in this handler path. |
Example:
"routing_hints": {
"mode": "balanced",
"preference_dial": 0.15,
"max_latency_ms": 700,
"task_complexity": "standard"
}3) Hard overrides (routing_override.*)
Use only for controlled tests and debugging.
"routing_override": {
"force_tier": "T3",
"force_lane": "turbo",
"force_model": "your-model-slug"
}force_provider is accepted for forward compatibility, but is not directly enforced in the current chat handler path.
Cost levers that matter most
- Keep
model: "auto"unless you truly need pinning. - Enable caching for repeatable prompts (
xantly.enable_cache: true, default). - Use
service_tier: "batch"for cost-sensitive workloads. - Use low
preference_dial+ latency budget for high-volume pipelines. - Avoid unnecessary high reliability modes for non-critical tasks.
Caching patterns
Exact and semantic cache can reduce cost significantly for repeat/similar traffic.
"xantly": {
"enable_cache": true
}Observe cache outcomes with headers:
x-xantly-cache-hitx-xantly-cache-type(exactorsemanticon cache-hit paths)x-xantly-semantic-similarity(semantic hits)
Copy-paste templates
High-volume lightweight classification
{
"model": "auto",
"max_tokens": 32,
"routing_hints": {
"mode": "fast",
"preference_dial": 0.1,
"task_complexity": "trivial"
},
"messages": [
{"role": "system", "content": "Return one label: positive, neutral, or negative."},
{"role": "user", "content": "I love this feature."}
]
}Cost-sensitive extraction with structured output
{
"model": "auto",
"service_tier": "batch",
"response_format": {"type": "json_object"},
"routing_hints": {
"mode": "cost_optimized",
"task_complexity": "standard"
},
"messages": [
{"role": "user", "content": "Extract invoice_id, amount, and due_date from this text: ..."}
]
}Validation checklist
- Compare cost and latency before/after routing hints.
- Verify routing intent via
x-xantly-tier-usedandx-xantly-lane-used. - Track cache hit rate over time.
- Add overrides only in non-production experiments.