Intelligence Modes

Control which pipeline stages are active per request. Choose between raw speed, cost savings, or full personalization.

Three modes

Mode	What's active	Added latency	Use case
`proxy`	Auth, routing, provider call, response	~0ms overhead	Fastest possible. You have your own memory system, you just need reliable multi-provider routing with failover.
`cache`	Proxy + L0/L1 exact cache + semantic cache + output verification	~5-35ms	Cost savings on repetitive workloads. 40-60% reduction on chatbot/classification pipelines.
`full`	Everything: cache + memory cascade + entity extraction + context assembly	~30-80ms	Personalization, knowledge graph, conversation continuity. The "Jarvis" mode.

Setting the mode

Per-request header (highest priority)

curl https://api.xantly.com/v1/chat/completions \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -H "X-Intelligence-Mode: cache" \
  -H "Content-Type: application/json" \
  -d '{"model": "auto", "messages": [{"role": "user", "content": "Hello"}]}'

Per-request body parameter

{
  "model": "auto",
  "messages": [{"role": "user", "content": "Hello"}],
  "xantly": {
    "intelligence_mode": "cache"
  }
}

Header takes precedence over body (standard HTTP convention).

Per-tenant default

curl -X PUT https://api.xantly.com/v1/settings \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"default_intelligence_mode": "cache"}'

Priority hierarchy

Request header X-Intelligence-Mode
Request body xantly.intelligence_mode
Tenant default (via Settings API)
System default: full

What each mode activates

Feature	Proxy	Cache	Full
L0 Moka / L1 Redis exact cache	off	on	on
Semantic cache (Qdrant similarity)	off	on	on
Memory injection (conversation context)	off	off	on
Entity extraction (knowledge graph)	off	off	on
Memory router cascade	off	off	on
Output verification & healing	off	on	on
BaRP intelligent model selection	8-dim	392-dim	396-dim
Post-response memory storage	off	off	on

Per-request overrides still work

Intelligence mode AND-gates with per-request toggles. A customer in full mode who sends "enable_cache": false on a specific request still gets their override respected.

Response headers

Every response includes the resolved mode:

X-Xantly-Intelligence-Mode: cache

The mode is also included in the xantly_metadata response body:

{
  "xantly_metadata": {
    "intelligence_mode": "cache",
    ...
  }
}

Voice endpoint support

All /v1/voice/* endpoints support intelligence mode via the X-Intelligence-Mode header:

curl -X POST https://api.xantly.com/v1/voice/chat \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -H "X-Intelligence-Mode: cache" \
  -F "[email protected]" \
  -F "language=en"

Voice responses include X-Xantly-Intelligence-Mode in the response headers.

Voice escalation: When intelligence mode resolves to proxy, voice endpoints automatically escalate to cache. Voice benefits from cache hits on repeated utterances and should not operate without caching.

Memory gating: In cache mode, voice session context persistence (turn history, chain linking) is skipped — only caching operates. In full mode, session memory is fully active.

Cost savings with cache mode

For customers with repetitive workloads (chatbots, classification pipelines, RAG preprocessing):

Semantic cache catches intent-equivalent requests (not just exact matches)
A request that would cost $0.02-$0.10 in LLM tokens adds ~30ms of overhead
At 500 repetitive calls/day with 45% cache hit rate: ~$67/day saved
Monthly savings: ~$2,000 for typical support chatbot volumes

The overhead is the embedding generation (~15ms) and Qdrant similarity search (~5ms). Cache hits return in <5ms with zero LLM cost.

On this page