Intelligence Modes
Control which pipeline stages are active per request. Choose between raw speed, cost savings, or full personalization.
Control which pipeline stages are active per request. Choose between raw speed, cost savings, or full personalization.
Three modes
| Mode | What's active | Added latency | Use case |
|---|---|---|---|
proxy | Auth, routing, provider call, response | ~0ms overhead | Fastest possible. You have your own memory system, you just need reliable multi-provider routing with failover. |
cache | Proxy + L0/L1 exact cache + semantic cache + output verification | ~5-35ms | Cost savings on repetitive workloads. 40-60% reduction on chatbot/classification pipelines. |
full | Everything: cache + memory cascade + entity extraction + context assembly | ~30-80ms | Personalization, knowledge graph, conversation continuity. The "Jarvis" mode. |
Setting the mode
Per-request header (highest priority)
curl https://api.xantly.com/v1/chat/completions \
-H "Authorization: Bearer $XANTLY_API_KEY" \
-H "X-Intelligence-Mode: cache" \
-H "Content-Type: application/json" \
-d '{"model": "auto", "messages": [{"role": "user", "content": "Hello"}]}'Per-request body parameter
{
"model": "auto",
"messages": [{"role": "user", "content": "Hello"}],
"xantly": {
"intelligence_mode": "cache"
}
}Header takes precedence over body (standard HTTP convention).
Per-tenant default
curl -X PUT https://api.xantly.com/v1/settings \
-H "Authorization: Bearer $XANTLY_API_KEY" \
-H "Content-Type: application/json" \
-d '{"default_intelligence_mode": "cache"}'Priority hierarchy
- Request header
X-Intelligence-Mode - Request body
xantly.intelligence_mode - Tenant default (via Settings API)
- System default:
full
What each mode activates
| Feature | Proxy | Cache | Full |
|---|---|---|---|
| L0 Moka / L1 Redis exact cache | off | on | on |
| Semantic cache (Qdrant similarity) | off | on | on |
| Memory injection (conversation context) | off | off | on |
| Entity extraction (knowledge graph) | off | off | on |
| Memory router cascade | off | off | on |
| Output verification & healing | off | on | on |
| BaRP intelligent model selection | 8-dim | 392-dim | 396-dim |
| Post-response memory storage | off | off | on |
Per-request overrides still work
Intelligence mode AND-gates with per-request toggles. A customer in full mode who sends "enable_cache": false on a specific request still gets their override respected.
Response headers
Every response includes the resolved mode:
X-Xantly-Intelligence-Mode: cacheThe mode is also included in the xantly_metadata response body:
{
"xantly_metadata": {
"intelligence_mode": "cache",
...
}
}Voice endpoint support
All /v1/voice/* endpoints support intelligence mode via the X-Intelligence-Mode header:
curl -X POST https://api.xantly.com/v1/voice/chat \
-H "Authorization: Bearer $XANTLY_API_KEY" \
-H "X-Intelligence-Mode: cache" \
-F "[email protected]" \
-F "language=en"Voice responses include X-Xantly-Intelligence-Mode in the response headers.
Voice escalation: When intelligence mode resolves to proxy, voice endpoints automatically escalate to cache. Voice benefits from cache hits on repeated utterances and should not operate without caching.
Memory gating: In cache mode, voice session context persistence (turn history, chain linking) is skipped — only caching operates. In full mode, session memory is fully active.
Cost savings with cache mode
For customers with repetitive workloads (chatbots, classification pipelines, RAG preprocessing):
- Semantic cache catches intent-equivalent requests (not just exact matches)
- A request that would cost $0.02-$0.10 in LLM tokens adds ~30ms of overhead
- At 500 repetitive calls/day with 45% cache hit rate: ~$67/day saved
- Monthly savings: ~$2,000 for typical support chatbot volumes
The overhead is the embedding generation (~15ms) and Qdrant similarity search (~5ms). Cache hits return in <5ms with zero LLM cost.