XantlyANTLY
API Reference

Intelligence Modes

Control which pipeline stages are active per request. Choose between raw speed, cost savings, or full personalization.

Control which pipeline stages are active per request. Choose between raw speed, cost savings, or full personalization.


Three modes

ModeWhat's activeAdded latencyUse case
proxyAuth, routing, provider call, response~0ms overheadFastest possible. You have your own memory system, you just need reliable multi-provider routing with failover.
cacheProxy + L0/L1 exact cache + semantic cache + output verification~5-35msCost savings on repetitive workloads. 40-60% reduction on chatbot/classification pipelines.
fullEverything: cache + memory cascade + entity extraction + context assembly~30-80msPersonalization, knowledge graph, conversation continuity. The "Jarvis" mode.

Setting the mode

Per-request header (highest priority)

curl https://api.xantly.com/v1/chat/completions \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -H "X-Intelligence-Mode: cache" \
  -H "Content-Type: application/json" \
  -d '{"model": "auto", "messages": [{"role": "user", "content": "Hello"}]}'

Per-request body parameter

{
  "model": "auto",
  "messages": [{"role": "user", "content": "Hello"}],
  "xantly": {
    "intelligence_mode": "cache"
  }
}

Header takes precedence over body (standard HTTP convention).

Per-tenant default

curl -X PUT https://api.xantly.com/v1/settings \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"default_intelligence_mode": "cache"}'

Priority hierarchy

  1. Request header X-Intelligence-Mode
  2. Request body xantly.intelligence_mode
  3. Tenant default (via Settings API)
  4. System default: full

What each mode activates

FeatureProxyCacheFull
L0 Moka / L1 Redis exact cacheoffonon
Semantic cache (Qdrant similarity)offonon
Memory injection (conversation context)offoffon
Entity extraction (knowledge graph)offoffon
Memory router cascadeoffoffon
Output verification & healingoffonon
BaRP intelligent model selection8-dim392-dim396-dim
Post-response memory storageoffoffon

Per-request overrides still work

Intelligence mode AND-gates with per-request toggles. A customer in full mode who sends "enable_cache": false on a specific request still gets their override respected.


Response headers

Every response includes the resolved mode:

X-Xantly-Intelligence-Mode: cache

The mode is also included in the xantly_metadata response body:

{
  "xantly_metadata": {
    "intelligence_mode": "cache",
    ...
  }
}

Voice endpoint support

All /v1/voice/* endpoints support intelligence mode via the X-Intelligence-Mode header:

curl -X POST https://api.xantly.com/v1/voice/chat \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -H "X-Intelligence-Mode: cache" \
  -F "[email protected]" \
  -F "language=en"

Voice responses include X-Xantly-Intelligence-Mode in the response headers.

Voice escalation: When intelligence mode resolves to proxy, voice endpoints automatically escalate to cache. Voice benefits from cache hits on repeated utterances and should not operate without caching.

Memory gating: In cache mode, voice session context persistence (turn history, chain linking) is skipped — only caching operates. In full mode, session memory is fully active.


Cost savings with cache mode

For customers with repetitive workloads (chatbots, classification pipelines, RAG preprocessing):

  • Semantic cache catches intent-equivalent requests (not just exact matches)
  • A request that would cost $0.02-$0.10 in LLM tokens adds ~30ms of overhead
  • At 500 repetitive calls/day with 45% cache hit rate: ~$67/day saved
  • Monthly savings: ~$2,000 for typical support chatbot volumes

The overhead is the embedding generation (~15ms) and Qdrant similarity search (~5ms). Cache hits return in <5ms with zero LLM cost.

On this page