Guide: Cost-Optimized Routing

Use this guide to reduce spend while keeping output quality predictable.

Routing is automatic by default (model: "auto"). You only need routing controls when you want stricter cost/latency behavior.

Start with auto-routing

Use this as your baseline:

{
  "model": "auto",
  "messages": [{"role": "user", "content": "Summarize this customer feedback."}]
}

Then inspect response headers:

x-xantly-tier-used
x-xantly-lane-used
x-xantly-provider
x-xantly-cost-usd (when usage is available)
x-xantly-cache-hit

Routing controls, ordered by strength

1) Presets (`routing_hints.mode`)

Simple one-knob routing:

Mode	Typical intent
`fast`	Favor speed / lower latency
`balanced`	Let gateway balance speed, quality, and cost
`quality`	Favor stronger-quality paths
`cost_optimized`	Favor lower-cost paths
`free_models_only`	Prefer free-tier style behavior

"routing_hints": { "mode": "cost_optimized" }

2) Fine-grained hints (`routing_hints.*`)

Use these when preset behavior is not enough.

Field	Status	Notes
`preference_dial`	Active	Clamped to `0.0..1.0`. Lower = cheaper/faster bias; higher = quality bias.
`prefer_latency`	Active	Strong low-latency bias.
`max_latency_ms`	Active	Very low values (for example `<500`) strongly bias fast routing.
`task_complexity`	Active	`trivial`, `standard`, `complex`, `expert`.
`max_tier`	Active (best-effort)	Tier guardrail hint.
`chain_routing`	Active	`sticky` or `mixed`; `mixed` disables sticky continuation behavior.
`allow_free_fallback`	Active	Adds explicit fallback preference signal.
`prefer_quality`	Partial	Accepted; currently mainly affects how presets are applied.
`max_cost_per_token`	Advisory	Accepted for compatibility; not a strict enforcement path.
`required_capabilities`	Advisory	Accepted for compatibility; not a strict hard filter in this handler path.

Example:

"routing_hints": {
  "mode": "balanced",
  "preference_dial": 0.15,
  "max_latency_ms": 700,
  "task_complexity": "standard"
}

3) Hard overrides (`routing_override.*`)

Use only for controlled tests and debugging.

"routing_override": {
  "force_tier": "T3",
  "force_lane": "turbo",
  "force_model": "your-model-slug"
}

force_provider is accepted for forward compatibility, but is not directly enforced in the current chat handler path.

Cost levers that matter most

Keep model: "auto" unless you truly need pinning.
Enable caching for repeatable prompts (xantly.enable_cache: true, default).
Use service_tier: "batch" for cost-sensitive workloads.
Use low preference_dial + latency budget for high-volume pipelines.
Avoid unnecessary high reliability modes for non-critical tasks.

Caching patterns

Exact and semantic cache can reduce cost significantly for repeat/similar traffic.

"xantly": {
  "enable_cache": true
}

Observe cache outcomes with headers:

x-xantly-cache-hit
x-xantly-cache-type (exact or semantic on cache-hit paths)
x-xantly-semantic-similarity (semantic hits)

Copy-paste templates

High-volume lightweight classification

{
  "model": "auto",
  "max_tokens": 32,
  "routing_hints": {
    "mode": "fast",
    "preference_dial": 0.1,
    "task_complexity": "trivial"
  },
  "messages": [
    {"role": "system", "content": "Return one label: positive, neutral, or negative."},
    {"role": "user", "content": "I love this feature."}
  ]
}

Cost-sensitive extraction with structured output

{
  "model": "auto",
  "service_tier": "batch",
  "response_format": {"type": "json_object"},
  "routing_hints": {
    "mode": "cost_optimized",
    "task_complexity": "standard"
  },
  "messages": [
    {"role": "user", "content": "Extract invoice_id, amount, and due_date from this text: ..."}
  ]
}

Validation checklist

Compare cost and latency before/after routing hints.
Verify routing intent via x-xantly-tier-used and x-xantly-lane-used.
Track cache hit rate over time.
Add overrides only in non-production experiments.