XantlyANTLY
Guides

Benchmark Results

> Every test passes. Every parameter validated. Every SDK compatible. Zero regressions.

Every test passes. Every parameter validated. Every SDK compatible. Zero regressions.

Xantly's API is tested against the most demanding evaluation suite in the industry — 5 independent layers, 230+ deep grading cases, 145+ parameter tests, 24 industry benchmarks, 10 client SDKs, 6 eval harnesses, 9 production chaos scenarios, and full endpoint coverage across all 11 OpenAI-compatible endpoints. The scorecard is public, the methodology is reproducible, and the results speak for themselves.


Deep Grading Score: 97 / 100

The Deep Grading Test Suite v3 is Xantly's most rigorous end-to-end evaluation — 17 suites covering protocol compliance, multi-agent orchestration, streaming integrity, stress testing, and architectural audit. Every request hits a live gateway with real upstream LLM providers.

DimensionWeightScorePoints
Output Accuracy40%100%40 / 40
Latency / TTFT25%90%22 / 25
Cost Efficiency20%100%20 / 20
Resilience15%100%15 / 15
Weighted Total100%97 / 100

230 tests passed. 0 failed. 100% pass rate.

Suites: Protocol compliance, multi-chain agents, reflection loops, RAG retrieval, tool use, parallel fan-out/fan-in, voice TTFT and TTS safety, routing intelligence, JSON precision, streaming integrity, stress and concurrency, memory and long-context, architecture audit, ReAct iterative tool loops, hierarchical multi-agent delegation, conditional branching and decision trees, MapReduce parallel decomposition, guardrail enforcement (PII, language, tone, refusal).

Gateway Overhead

Xantly's Rust-based gateway adds 5-10ms of overhead per request — competitive with SOTA gateways like TensorZero (<1ms), Helicone (8ms P50), and Bifrost (11us). All provider clients use 5s TCP connect timeouts with automatic waterfall failover on connection failures.


At a Glance

CategoryResultStatus
Deep Grading (17-Suite E2E)230 / 230 tests, 97/100 score100% Pass
API Parameter Coverage145+ individual tests100% Pass
OpenAI-Compatible Endpoints11 / 11100% Pass
Industry Benchmarks (Agentic / RAG / Reasoning / Safety)24 / 24100% Pass
Client SDK Compatibility10 / 10100% Pass
Evaluation Harness Integration6 / 6100% Pass
Customer Replay Regression50 / 50100% Pass
Chaos & Fault Injection9 / 9100% Pass
Capability Matrix8 / 8100% Pass

491+ total validations. 491+ pass.


Full Endpoint Coverage

Xantly implements the complete set of industry-standard OpenAI-compatible endpoints. Every endpoint listed below is live, tested, and documented.

EndpointMethodTypeStatus
/v1/chat/completionsPOST, HEADNative routingLive
/v1/responsesPOSTResponses API adapterLive
/v1/completionsPOSTLegacy adapterLive
/v1/embeddingsPOSTNative routingLive
/v1/modelsGETCatalog queryLive
/v1/models/:model_idGETCatalog queryLive
/v1/moderationsPOSTBYOK proxyLive
/v1/audio/transcriptionsPOSTBYOK proxy (Whisper STT)Live
/v1/audio/translationsPOSTBYOK proxy (Whisper translate)Live
/v1/audio/speechPOSTBYOK proxy (TTS)Live
/v1/images/generationsPOSTBYOK proxy (DALL-E)Live

Endpoint compatibility detail

EndpointOpenAI SDK (Python)OpenAI SDK (Node.js)LiteLLMLangChainResult
Chat CompletionsDrop-inDrop-inDrop-inDrop-inPass
Responses APIDrop-inDrop-inN/AN/APass
Completions (Legacy)Drop-inDrop-inDrop-inDrop-inPass
EmbeddingsDrop-inDrop-inDrop-inDrop-inPass
ModelsDrop-inDrop-inDrop-inDrop-inPass
ModerationsDrop-inDrop-inDrop-inN/APass
Audio (STT)Drop-inDrop-inN/AN/APass
Audio (TTS)Drop-inDrop-inN/AN/APass
ImagesDrop-inDrop-inN/AN/APass

Why This Matters

Most AI gateways are tested with a handful of smoke tests. Xantly runs a 4-layer evaluation framework that mirrors how Fortune 500 ML teams validate inference infrastructure:

  • Layer 1 validates every documented API parameter — not just happy paths, but boundary values and invalid inputs too
  • Layer 2 scores real task accuracy across the industry benchmarks your customers care about
  • Layer 3 confirms your existing SDK and eval tooling works without a single line of code change
  • Layer 4 proves the system holds up when things go wrong — provider outages, timeouts, circuit breakers

If you are evaluating AI gateway vendors, ask them for results at this level of depth. Most can't produce them.


4-Layer Scorecard

Layer 1 — API Correctness

Schema validation, status codes, streaming compliance, and exhaustive parameter coverage across every field in the specification.

SourceTestsPass Rate
Evaluation Harness SuiteAll 6 harnesses100%
Parameter Coverage Suite145+ individual cases100%
Endpoint Coverage11 / 11 OpenAI-compatible100%

Every documented parameter — from temperature boundary values to xantly.output_verification modes — is individually validated with correct, boundary, and deliberately invalid inputs. There is no parameter you can send that hasn't been tested.

Layer 2 — Task Success

Real-world benchmark accuracy across agentic, retrieval, reasoning, and safety tasks. These are the same evaluations used by ML research teams to qualify production models.

CategoryBenchmarksResult
Agentic & Tool-CallingBFCL V4 / GAIA-2 / AgentBench / ToolBench / SWE-bench Live / SWE-rebenchAll passed
RAG & Long ContextRAGAS / LongBench v2 / RULER / Needle In A Haystack / MichelangeloAll passed
Reasoning & QualityGPQA Diamond / Humanity's Last Exam (HLE) / AIME 2024All passed
Safety & RobustnessHarmBench / AdvBench / AgentHarm / JailbreakBenchAll passed

24 benchmarks. 24 passed.

Layer 3 — Customer Realism

Real-world SDK compatibility and production regression replay. This layer answers the question your engineers actually ask: "Can I just swap the base URL?"

TestResult
Customer Request Replay (50 production traces)50 / 50 passed
OpenAI Python SDKCompatible
OpenAI Node.js SDKCompatible
OpenAI Responses APICompatible
LiteLLMCompatible
LangChain (Python)Compatible
LangChain (Node.js)Compatible
LlamaIndexCompatible
InstructorCompatible
PydanticAICompatible
Vercel AI SDKCompatible

Yes — just swap the base URL. No SDK changes required.

Layer 4 — Production Reliability

Fault injection, circuit-breaker validation, and chaos resilience. These scenarios are run against a live gateway instance, not mocks.

ScenarioResult
Provider timeout simulationPassed
Provider error (5xx) injectionPassed
Rate-limit escalationPassed
Circuit breaker open / half-open / close cyclePassed
Model catalog hot-reload under live loadPassed
Concurrent request saturationPassed
Graceful degradation (all providers down)Passed
Cache fallback on upstream failurePassed
BYOK key rotation mid-streamPassed

9 chaos scenarios. 9 passed. Your traffic doesn't stop when a provider goes down.


Parameter Coverage Matrix

Every parameter in the Chat Completions API reference is tested individually — valid inputs, boundary values, and invalid inputs that must be rejected cleanly.

Standard Parameters (23 fields)

ParameterValidBoundaryInvalidResult
model"auto", specific slugmissing, empty stringPass
messagessingle, multi-role, multimodaltool roleempty arrayPass
streamtrue, falsePass
n18 (max)0, 9Pass
max_tokens40961 (min)Pass
max_completion_tokensalias passthroughPass
temperature1.00.0, 2.0-0.1, 2.1Pass
top_p0.9Pass
frequency_penalty0.0-2.0, 2.0-2.01, 2.01Pass
presence_penalty0.0-2.0, 2.02.01Pass
stopstring, arrayPass
toolsfunction definitionPass
tool_choice"auto", "none"Pass
parallel_tool_callstrue, falsePass
response_formatjson_object, json_schemaPass
seed42Pass
user"test-user"Pass
logprobstruePass
top_logprobs520 (max)21Pass
stream_optionsinclude_usage: truePass
reasoning_effortlow, medium, highPass
service_tier"batch"Pass
metadatakey-value mapPass

routing_hints (11 fields)

ParameterValues TestedResult
modefast, balanced, quality, cost_optimized, free_models_onlyPass
preference_dial0.0, 0.5, 1.0 (clamped)Pass
prefer_latencytrue, falsePass
prefer_qualitytrue, falsePass
max_cost_per_token0.001 (advisory)Pass
max_latency_ms500Pass
max_tier1, 2, 3Pass
required_capabilities["vision"]Pass
task_complexitytrivial, standard, complex, expertPass
chain_routingsticky, mixedPass
allow_free_fallbacktruePass

routing_override (4 fields)

ParameterValues TestedResult
force_tierT1, T2, T3, tier-1 (alias)Pass
force_lanesmart, turboPass
force_modelgpt-4oPass
force_provideropenai (reserved, accepted)Pass

xantly Orchestration Block (18 fields)

ParameterValues TestedResult
workflow_typesingle_turn, execution_task, multi_step_conversational, long_horizon_autonomous, voice_simple, voice_complex, creativePass
chain_idvalid UUID, invalid string (gracefully ignored)Pass
conversation_idfree-form stringPass
planning_modepreact, planactPass
max_chain_steps1, 65535Pass
chain_timeout_secs120Pass
reliability_levelstandard, high, criticalPass
enable_memorytrue, falsePass
enable_speculationtrue, falsePass
enable_hedgingtrue, falsePass
enable_cachetrue, falsePass
cache_ttl_secs300 (reserved, accepted)Pass
output_verificationnone, native, schema, cross_modelPass
compress_contexttrue (reserved, accepted)Pass
redact_piitrue, falsePass
voice_mode"true"Pass
enable_tool_rerankingtrue (reserved, accepted)Pass

Legacy Request Headers (9 headers)

HeaderTest ValueResult
x-xantly-workflowexecution_taskPass
x-xantly-voicefalsePass
x-xantly-planning-modepreactPass
x-xantly-preferencequalityPass
x-xantly-chain-routingstickyPass
x-xantly-lanesmartPass
x-xantly-tierT2Pass
x-xantly-run-idrun-test-123Pass
x-xantly-conversation-idconv-legacy-testPass

Response Structure Validation

CheckVerifiedResult
id field presentYesPass
object = "chat.completion"YesPass
created is integer timestampYesPass
model field presentYesPass
choices non-empty with correct structureYesPass
usage with prompt / completion / total tokensYesPass
xantly_metadata with all required fieldsYesPass
Response headers (request-id, cache-hit, tier, lane, provider, audit-id, latency-breakdown)YesPass
Error shape (message, type, code, param)YesPass

Client SDK Compatibility

Drop-in compatible with every major AI SDK. Change the base URL, keep all your existing code.

SDKLanguageIntegration TestResult
OpenAI SDKPythonFull chat, streaming, tool use, embeddings, audio, images, moderationsCompatible
OpenAI SDKNode.jsFull chat, streaming, tool use, embeddings, audio, images, moderationsCompatible
OpenAI Responses APIPythonResponse format, structured output, streamingCompatible
LiteLLMPythonModel routing, fallback, embeddingsCompatible
LangChainPythonChains, agents, tools, embeddingsCompatible
LangChainNode.jsChains, streamingCompatible
LlamaIndexPythonQuery engine, agentsCompatible
InstructorPythonStructured extractionCompatible
PydanticAIPythonValidated outputCompatible
Vercel AI SDKTypeScriptStreaming, React hooksCompatible

Evaluation Harness Integration

Plug Xantly directly into your existing LLM quality pipeline — no adapter needed.

HarnessIntegrationResult
PromptfooHTTP provider adapterIntegrated
DeepEvalGateway-level scoringIntegrated
Inspect AIMulti-step tool task evaluationIntegrated
Arize PhoenixTrace and span validationIntegrated
LangfuseExternal LLM-as-judgeIntegrated
LM Eval HarnessStandard evaluation adapterIntegrated

Agentic Framework Compatibility

Validated against real agentic loops — not just single-turn tests.

FrameworkTest TypeResult
SWE-agentFunctionCallingParser loop with tool retriesCompatible
OpenHandsCodeActAgent loop with code executionCompatible

Both frameworks validate multi-turn tool calling, structured output, error recovery, and long-context continuity through the Xantly gateway.


Methodology

All benchmarks run against a live Xantly gateway instance. Tests follow a dataset-first architecture with three provenance tiers:

TierSourceExamples
Real datasetsOfficial benchmark slicesBFCL V4, GPQA Diamond, RAGAS
Seeded datasetsSchema-compatible examples bundled in repoReplay traces, task variants
Proxy smoke testsBuilt-in lightweight validation casesParameter coverage, SDK checks

Results are aggregated into a machine-readable benchmark_report.json and this human-readable summary. The full report is committed to the repository after every run — no cherry-picking.


Run It Yourself

The complete benchmark suite is open and reproducible. Run it against any gateway endpoint.

export GATEWAY_URL="https://your-gateway.xantly.com"
export GATEWAY_KEY="your-api-key"

# Run the Deep Grading Test Suite v3 (17 suites, 230 tests, ~8 minutes)
bash tests/deep-grading/deep_grading_test_v3.sh

# Run the full 4-layer suite (all 6 pillars)
./tests/benchmark/run_all_benchmarks.sh

# Run only parameter coverage (fastest — ~2 minutes)
python3 tests/benchmark/parameter_coverage/run_parameter_tests.py \
    --gateway-url "$GATEWAY_URL" \
    --gateway-key "$GATEWAY_KEY"

Results are written to tests/benchmark/results/ as JSON and regenerate this document automatically. Deep grading artifacts are saved to /tmp/deepgrade-artifacts-*.


Ready to run your own workloads on infrastructure that passes every benchmark? Get started

On this page