XantlyANTLY
API Reference

Chat Completions

Create completions with OpenAI-compatible request/response shapes plus optional Xantly orchestration controls.

Create completions with OpenAI-compatible request/response shapes plus optional Xantly orchestration controls.

  • POST /v1/chat/completions
  • HEAD /v1/chat/completions (returns 204 No Content; useful as a lightweight probe)
  • Auth: Authorization: Bearer <token>

Recommended rollout: start with standard fields only (model, messages, temperature, max_tokens), then add orchestration controls incrementally.


Quick start example

curl -sS https://api.xantly.com/v1/chat/completions \
  -H "Authorization: Bearer $XANTLY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [
      {"role": "user", "content": "Explain vector databases in 2 bullets."}
    ]
  }'

Request body

Standard parameters (OpenAI-compatible)

FieldTypeRequiredDefaultValidation / behavior
modelstringYesUse "auto" (recommended) or a valid catalog model slug/upstream name. Unknown models are rejected when catalog is loaded.
messagesarray<ChatMessage>YesMust contain at least 1 message.
streambooleanNofalseEnables SSE output.
nintegerNo1Allowed range: 1..=8.
max_tokensintegerNoprovider/model dependentAlias supported: max_completion_tokens.
temperaturenumberNoprovider/model dependentValidated range: 0.0..=2.0.
top_pnumberNoprovider/model dependentPassed through to provider.
frequency_penaltynumberNoprovider/model dependentValidated range: -2.0..=2.0.
presence_penaltynumberNoprovider/model dependentValidated range: -2.0..=2.0.
stopstring | array<string>NonullPassed through to provider.
toolsarray<object>NonullTool definitions for function-calling workflows.
tool_choicestring | objectNoprovider dependentPassed through to provider.
parallel_tool_callsbooleanNoprovider dependentPassed through to provider.
response_formatobjectNonullSupports {"type":"json_object"} and {"type":"json_schema", "json_schema": {...}}.
seedinteger (u64)NonullDeterminism hint.
userstringNonullEnd-user identifier, forwarded to provider.
logprobsbooleanNonullForwarded to provider.
top_logprobsinteger (u8)NonullValidated max: 20.
stream_options.include_usagebooleanNofalseForwarded to provider where supported.
reasoning_effortstringNonullAccepted values: low, medium, high (provider-dependent behavior).
service_tierstringNonull"batch" is accepted and treated as a cost-oriented routing hint.
metadataobject<string,string>NonullFree-form metadata map used for tracing/routing context.

ChatMessage object

FieldTypeRequiredNotes
rolestringYesCommon values: system, user, assistant, tool.
contentstring | arrayYesString for text, array for multimodal content parts.
namestringNoOptional participant name.
tool_callsarray<object>NoAssistant tool call payloads.
tool_call_idstringNoCorrelates tool output to a prior tool call.
refusalstringNoOptional refusal text.

Proprietary parameters (Xantly)

These are optional. If omitted, the gateway uses defaults and automatic policy.

routing_hints (soft preferences)

FieldTypeValue range / valuesBackend statusUsage guidance
modestringfast, balanced, quality, cost_optimized, free_models_onlyActiveCoarse routing preset when explicit preference knobs are not set.
preference_dialnumberBest used in 0.0..1.0 (values are clamped)ActiveLower biases cost/speed; higher biases quality.
prefer_latencybooleantrue/falseActivetrue strongly biases low-latency execution.
prefer_qualitybooleantrue/falsePartialAccepted; currently primarily suppresses mode preset behavior when explicit knobs are present.
max_cost_per_tokennumberpositive floatReserved / advisoryAccepted for forward compatibility. Do not rely on strict enforcement yet.
max_latency_msintegerpositive integerActiveSets latency budget; very low budgets bias faster lanes.
max_tierinteger1, 2, 3ActiveTier guardrail hint.
required_capabilitiesarray<string>free-form stringsReserved / advisoryAccepted for forward compatibility; not a strict hard filter in this handler path.
task_complexitystringtrivial, standard, complex, expertActiveInfluences tier floor/ceiling behavior.
chain_routingstringsticky, mixedActivemixed disables sticky-route continuation behavior.
allow_free_fallbackbooleantrue/falseActivePassed into metadata as an explicit fallback preference signal.

routing_override (harder overrides)

FieldTypeValuesBackend statusUsage guidance
force_tierstringT1, T2, T3 (also tier-1, tier1, etc.)ActiveForces selected tier mapping.
force_lanestringsmart, turboActiveForces lane.
force_modelstringmodel slug/upstream idActivePins model after routing.
force_providerstringprovider identifierReservedAccepted in schema; not directly applied in this chat handler flow today.

xantly orchestration block

FieldTypeValues / rangeEffective defaultBackend statusUsage guidance
workflow_typestringsingle_turn, execution_task, multi_step_conversational, long_horizon_autonomous, voice_simple, voice_complex, creativeauto-classifiedActiveExplicitly sets workflow class when recognized.
chain_idstringUUID recommendednullActiveSignals chain continuation; invalid UUID is ignored by classifier.
conversation_idstringfree-form idnullActiveUsed for memory/sticky context continuity.
planning_modestringpreact or planacttenant/default heuristicActiveControls planner style where planning layer is active.
max_chain_stepsinteger (u16)1..65535workflow-dependentActive (conditional)Applies when workflow is long_horizon_autonomous.
chain_timeout_secsinteger (u32)0..4294967295workflow-dependentActive (conditional)Applies when workflow is long_horizon_autonomous.
chain_routingstringe.g. sticky / mixednullReservedAccepted in payload; rely on routing_hints.chain_routing for current behavior.
reliability_levelstringstandard, high, criticalstandardActiveInfluences reliability/verification activation.
enable_memorybooleantrue/falsetrueActiveControls persistence into L1/L2 memory in this handler path.
enable_speculationbooleantrue/falserouter/tenant defaultActivePer-request override for speculation toggle.
enable_hedgingbooleantrue/falserouter/tenant defaultActivePer-request override for hedging toggle.
enable_cachebooleantrue/falsetrueActiveEnables/disables cache lookup path for this request.
cache_ttl_secsintegerpositive integernullReservedAccepted but not currently used to override cache TTL in this handler.
output_verificationstringnone, native, schema, cross_modelstrategy auto-selectionActivePer-request override for output verification strategy.
compress_contextbooleantrue/falsenullReservedAccepted for forward compatibility.
redact_piibooleantrue/falsefalseActiveSets request redaction signal (x-redact-pii=true metadata).
voice_modestringtypically "true" for voice pathnullActiveEnables voice-oriented handling when set.
enable_tool_rerankingbooleantrue/falsenullReservedAccepted for forward compatibility.
intelligence_modestringproxy, cache, fullfull (system default)ActivePipeline intelligence preset. See Intelligence Modes.

Legacy request headers (still supported)

The gateway maps selected request headers into metadata for backward compatibility.

HeaderMapped metadata key
x-xantly-workflowx-xantly-workflow
x-xantly-voicex-xantly-voice
x-xantly-planning-modex-xantly-planning-mode
x-xantly-preferencex-xantly-preference
x-xantly-chain-routingx-xantly-chain-routing
x-xantly-lanex-xantly-lane
x-xantly-tierx-xantly-tier
x-xantly-run-idrun_id
x-xantly-conversation-idconversation_id
x-intelligence-modex-intelligence-mode

Response body

Non-stream (200 OK)

{
  "id": "chatcmpl-...",
  "object": "chat.completion",
  "created": 1741400000,
  "model": "deepseek-chat",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 31,
    "completion_tokens": 18,
    "total_tokens": 49
  },
  "xantly_metadata": {
    "request_id": "req_01abc...",
    "routing_decision": "Lane: Turbo (Source: BarpRouter)",
    "provider": "deepseek",
    "tier": "T2",
    "provider_tier": "T2",
    "latency_ms": 142,
    "requested_model": "auto",
    "decision_source": "BarpRouter",
    "task_family": "analysis",
    "cost_usd": 0.00032,
    "baseline_cost_usd": 0.00120,
    "savings_usd": 0.00088,
    "savings_pct": 73.3,
    "cost_attribution": "xantly",
    "estimated_cost_usd": 0.00032,
    "healer_report": null
  }
}

xantly_metadata fields

FieldTypeAlways presentDescription
request_idstringYesCorrelation ID for support and audit.
routing_decisionstringYesHuman-readable routing summary (lane + source).
providerstringYesProvider that served the request (e.g. "deepseek", "anthropic").
tierstringYesEffective execution tier ("T1", "T2", "T3").
provider_tierstring?NoProvider-level tier when different from effective tier.
latency_msintegerYesEnd-to-end gateway latency in milliseconds.
requested_modelstringYesThe model string you sent (e.g. "auto", "gpt-4o"). Useful for debugging routing decisions when using "auto".
decision_sourcestring?NoMachine-readable routing engine source (e.g. "BarpRouter", "Pinned", "CacheHit").
task_familystring?NoTask category detected by the smart router (e.g. "code", "writing", "analysis").
cost_usdnumber?NoActual cost for this specific request in USD. Present when usage data is available.
baseline_cost_usdnumber?NoGPT-4o reference cost for the same token counts (input $2.50/M, output $10/M). Lets you compute savings without a separate analytics call.
savings_usdnumber?Nobaseline_cost_usd - cost_usd. Positive value means Xantly saved money vs. GPT-4o.
savings_pctnumber?NoSavings as a percentage of baseline (0–100+).
cost_attributionstringYes"xantly" when routed via Xantly-managed provider keys; "byok" when routed via your own API key.
estimated_cost_usdnumber?NoLegacy alias for cost_usd. Kept for backward compatibility.
healer_reportobject?NoPresent when JSON repair was applied. Contains original, healed, stage, confidence, healing_time_ms.

Notes

  • choices[].logprobs is part of schema, but may be omitted in normalized API responses.
  • For n > 1, if upstream returns fewer choices, the gateway may replicate the first choice to satisfy requested cardinality.

Stream (text/event-stream)

  • All stream responses emit chat.completion.chunk events and terminate with data: [DONE].
  • Non-voice streaming currently emits a compact sequence (role chunk + content chunk + [DONE]) instead of token-by-token chunking.
  • stream_options.include_usage is forwarded to providers where supported; do not assume a terminal usage chunk is always present.

Response headers

Common observability headers

HeaderPresent whenDescription
x-xantly-request-idstandard non-stream path and semantic-cache responsesRequest correlation id.
x-xantly-cache-hitstandard non-stream path and cache responsestrue/false.
x-xantly-tier-usednon-stream and exact-cache responsesEffective execution tier.
x-xantly-lane-usednon-stream and exact-cache responsesEffective lane (smart/turbo).
x-xantly-providernon-stream and exact-cache responsesProvider/model source label.
x-xantly-speculation-acceptednon-streamCurrently emitted as 0 placeholder.
x-xantly-latency-breakdownnon-streamJSON string with stage-level latency metrics.
x-xantly-audit-idnon-streamCorrelation id for audit trail.

Cache-path headers

HeaderValue
x-xantly-cache-typeexact or semantic
x-xantly-latency-msend-to-end latency (cache path)
x-xantly-semantic-similaritysimilarity score (semantic cache only)

Usage / cost headers

These are included when usage is available:

  • x-xantly-input-tokens
  • x-xantly-output-tokens
  • x-xantly-cost-usd

Errors

Error payloads use OpenAI-style shape:

{
  "error": {
    "message": "temperature (2.5) must be between 0 and 2",
    "type": "invalid_request_error",
    "code": "validation_error",
    "param": null
  }
}
HTTPerror.typeerror.code (examples)Typical trigger
400invalid_request_errorvalidation_errorInvalid parameter range, unknown model, empty messages.
401authentication_errorinvalid_api_key, expired_api_key, revoked_api_keyMissing/invalid API key.
403authorization_errorforbidden, insufficient_scopes, tenant_violationScope/policy denial.
404not_found_error or governance_errorresource_not_found, tool_not_registeredMissing referenced resource or unregistered tool.
422governance_errortool_call_blockedGovernance blocked a tool call.
429rate_limit_errorrate_limit_exceeded, upsteam_rate_limitTenant or upstream rate limiting.
502upstream_errorprovider_errorUpstream provider error.
503governance_errorcircuit_breaker_openCircuit breaker open.
504upstream_errorprovider_timeoutUpstream timeout.
500internal_errorinternal_errorInternal platform failure.

Error headers

  • x-error-id is returned on error responses for support correlation.
  • retry-after is returned on eligible rate-limit responses.

Edge cases and implementation notes

  1. Model validation is catalog-aware

    • model validation is strict when the model catalog is loaded.
    • During catalog cold-start/empty states, strict rejection may be relaxed.
  2. top_logprobs validation

    • Server enforces top_logprobs &lt;= 20.
    • Provider-specific logprobs behavior may vary.
  3. Chain limit overrides are conditional

    • xantly.max_chain_steps and xantly.chain_timeout_secs apply only when workflow resolves to long_horizon_autonomous.
  4. Streaming semantics

    • Non-voice stream mode is SSE-compatible but not guaranteed token-by-token.
  5. Reserved fields

    • Some accepted proprietary fields are currently advisory/reserved for compatibility (force_provider, cache_ttl_secs, compress_context, enable_tool_reranking, and certain routing_hints fields).

Practical rollout checklist

  1. Start with model: "auto" + standard fields.
  2. Enable response_format for structured outputs.
  3. Add one routing/orchestration control at a time.
  4. Track x-xantly-* response headers in observability.
  5. Add override fields only for test/debug traffic.

See also

  • Benchmark Results — Every parameter on this page is individually validated. See the full 252-test scorecard including boundary and invalid-input cases.
  • Streaming Responses — Complete guide to SSE format, chunk handling, and SDK examples.
  • Rate Limits — RPM and TPM limits that apply to this endpoint.
  • Billing & Quotas — Token quotas, budget caps, and cost visibility via xantly_metadata.

On this page