Benchmark Results

> Every test passes. Every parameter validated. Every SDK compatible. Zero regressions.

Every test passes. Every parameter validated. Every SDK compatible. Zero regressions.

Xantly's API is tested against the most demanding evaluation suite in the industry — 5 independent layers, 230+ deep grading cases, 145+ parameter tests, 24 industry benchmarks, 10 client SDKs, 6 eval harnesses, 9 production chaos scenarios, and full endpoint coverage across all 11 OpenAI-compatible endpoints. The scorecard is public, the methodology is reproducible, and the results speak for themselves.

Deep Grading Score: 97 / 100

The Deep Grading Test Suite v3 is Xantly's most rigorous end-to-end evaluation — 17 suites covering protocol compliance, multi-agent orchestration, streaming integrity, stress testing, and architectural audit. Every request hits a live gateway with real upstream LLM providers.

Dimension	Weight	Score	Points
Output Accuracy	40%	100%	40 / 40
Latency / TTFT	25%	90%	22 / 25
Cost Efficiency	20%	100%	20 / 20
Resilience	15%	100%	15 / 15
Weighted Total	100%		97 / 100

230 tests passed. 0 failed. 100% pass rate.

Suites: Protocol compliance, multi-chain agents, reflection loops, RAG retrieval, tool use, parallel fan-out/fan-in, voice TTFT and TTS safety, routing intelligence, JSON precision, streaming integrity, stress and concurrency, memory and long-context, architecture audit, ReAct iterative tool loops, hierarchical multi-agent delegation, conditional branching and decision trees, MapReduce parallel decomposition, guardrail enforcement (PII, language, tone, refusal).

Gateway Overhead

Xantly's Rust-based gateway adds 5-10ms of overhead per request — competitive with SOTA gateways like TensorZero (<1ms), Helicone (8ms P50), and Bifrost (11us). All provider clients use 5s TCP connect timeouts with automatic waterfall failover on connection failures.

At a Glance

Category	Result	Status
Deep Grading (17-Suite E2E)	230 / 230 tests, 97/100 score	100% Pass
API Parameter Coverage	145+ individual tests	100% Pass
OpenAI-Compatible Endpoints	11 / 11	100% Pass
Industry Benchmarks (Agentic / RAG / Reasoning / Safety)	24 / 24	100% Pass
Client SDK Compatibility	10 / 10	100% Pass
Evaluation Harness Integration	6 / 6	100% Pass
Customer Replay Regression	50 / 50	100% Pass
Chaos & Fault Injection	9 / 9	100% Pass
Capability Matrix	8 / 8	100% Pass

491+ total validations. 491+ pass.

Full Endpoint Coverage

Xantly implements the complete set of industry-standard OpenAI-compatible endpoints. Every endpoint listed below is live, tested, and documented.

Endpoint	Method	Type	Status
`/v1/chat/completions`	POST, HEAD	Native routing	Live
`/v1/responses`	POST	Responses API adapter	Live
`/v1/completions`	POST	Legacy adapter	Live
`/v1/embeddings`	POST	Native routing	Live
`/v1/models`	GET	Catalog query	Live
`/v1/models/:model_id`	GET	Catalog query	Live
`/v1/moderations`	POST	BYOK proxy	Live
`/v1/audio/transcriptions`	POST	BYOK proxy (Whisper STT)	Live
`/v1/audio/translations`	POST	BYOK proxy (Whisper translate)	Live
`/v1/audio/speech`	POST	BYOK proxy (TTS)	Live
`/v1/images/generations`	POST	BYOK proxy (DALL-E)	Live

Endpoint compatibility detail

Endpoint	OpenAI SDK (Python)	OpenAI SDK (Node.js)	LiteLLM	LangChain	Result
Chat Completions	Drop-in	Drop-in	Drop-in	Drop-in	Pass
Responses API	Drop-in	Drop-in	N/A	N/A	Pass
Completions (Legacy)	Drop-in	Drop-in	Drop-in	Drop-in	Pass
Embeddings	Drop-in	Drop-in	Drop-in	Drop-in	Pass
Models	Drop-in	Drop-in	Drop-in	Drop-in	Pass
Moderations	Drop-in	Drop-in	Drop-in	N/A	Pass
Audio (STT)	Drop-in	Drop-in	N/A	N/A	Pass
Audio (TTS)	Drop-in	Drop-in	N/A	N/A	Pass
Images	Drop-in	Drop-in	N/A	N/A	Pass

Why This Matters

Most AI gateways are tested with a handful of smoke tests. Xantly runs a 4-layer evaluation framework that mirrors how Fortune 500 ML teams validate inference infrastructure:

Layer 1 validates every documented API parameter — not just happy paths, but boundary values and invalid inputs too
Layer 2 scores real task accuracy across the industry benchmarks your customers care about
Layer 3 confirms your existing SDK and eval tooling works without a single line of code change
Layer 4 proves the system holds up when things go wrong — provider outages, timeouts, circuit breakers

If you are evaluating AI gateway vendors, ask them for results at this level of depth. Most can't produce them.

4-Layer Scorecard

Layer 1 — API Correctness

Schema validation, status codes, streaming compliance, and exhaustive parameter coverage across every field in the specification.

Source	Tests	Pass Rate
Evaluation Harness Suite	All 6 harnesses	100%
Parameter Coverage Suite	145+ individual cases	100%
Endpoint Coverage	11 / 11 OpenAI-compatible	100%

Every documented parameter — from temperature boundary values to xantly.output_verification modes — is individually validated with correct, boundary, and deliberately invalid inputs. There is no parameter you can send that hasn't been tested.

Layer 2 — Task Success

Real-world benchmark accuracy across agentic, retrieval, reasoning, and safety tasks. These are the same evaluations used by ML research teams to qualify production models.

Category	Benchmarks	Result
Agentic & Tool-Calling	BFCL V4 / GAIA-2 / AgentBench / ToolBench / SWE-bench Live / SWE-rebench	All passed
RAG & Long Context	RAGAS / LongBench v2 / RULER / Needle In A Haystack / Michelangelo	All passed
Reasoning & Quality	GPQA Diamond / Humanity's Last Exam (HLE) / AIME 2024	All passed
Safety & Robustness	HarmBench / AdvBench / AgentHarm / JailbreakBench	All passed

24 benchmarks. 24 passed.

Layer 3 — Customer Realism

Real-world SDK compatibility and production regression replay. This layer answers the question your engineers actually ask: "Can I just swap the base URL?"

Test	Result
Customer Request Replay (50 production traces)	50 / 50 passed
OpenAI Python SDK	Compatible
OpenAI Node.js SDK	Compatible
OpenAI Responses API	Compatible
LiteLLM	Compatible
LangChain (Python)	Compatible
LangChain (Node.js)	Compatible
LlamaIndex	Compatible
Instructor	Compatible
PydanticAI	Compatible
Vercel AI SDK	Compatible

Yes — just swap the base URL. No SDK changes required.

Layer 4 — Production Reliability

Fault injection, circuit-breaker validation, and chaos resilience. These scenarios are run against a live gateway instance, not mocks.

Scenario	Result
Provider timeout simulation	Passed
Provider error (5xx) injection	Passed
Rate-limit escalation	Passed
Circuit breaker open / half-open / close cycle	Passed
Model catalog hot-reload under live load	Passed
Concurrent request saturation	Passed
Graceful degradation (all providers down)	Passed
Cache fallback on upstream failure	Passed
BYOK key rotation mid-stream	Passed

9 chaos scenarios. 9 passed. Your traffic doesn't stop when a provider goes down.

Parameter Coverage Matrix

Every parameter in the Chat Completions API reference is tested individually — valid inputs, boundary values, and invalid inputs that must be rejected cleanly.

Standard Parameters (23 fields)

Parameter	Valid	Boundary	Invalid	Result
`model`	`"auto"`, specific slug	—	missing, empty string	Pass
`messages`	single, multi-role, multimodal	tool role	empty array	Pass
`stream`	`true`, `false`	—	—	Pass
`n`	`1`	`8` (max)	`0`, `9`	Pass
`max_tokens`	`4096`	`1` (min)	—	Pass
`max_completion_tokens`	alias passthrough	—	—	Pass
`temperature`	`1.0`	`0.0`, `2.0`	`-0.1`, `2.1`	Pass
`top_p`	`0.9`	—	—	Pass
`frequency_penalty`	`0.0`	`-2.0`, `2.0`	`-2.01`, `2.01`	Pass
`presence_penalty`	`0.0`	`-2.0`, `2.0`	`2.01`	Pass
`stop`	string, array	—	—	Pass
`tools`	function definition	—	—	Pass
`tool_choice`	`"auto"`, `"none"`	—	—	Pass
`parallel_tool_calls`	`true`, `false`	—	—	Pass
`response_format`	`json_object`, `json_schema`	—	—	Pass
`seed`	`42`	—	—	Pass
`user`	`"test-user"`	—	—	Pass
`logprobs`	`true`	—	—	Pass
`top_logprobs`	`5`	`20` (max)	`21`	Pass
`stream_options`	`include_usage: true`	—	—	Pass
`reasoning_effort`	`low`, `medium`, `high`	—	—	Pass
`service_tier`	`"batch"`	—	—	Pass
`metadata`	key-value map	—	—	Pass

`routing_hints` (11 fields)

Parameter	Values Tested	Result
`mode`	`fast`, `balanced`, `quality`, `cost_optimized`, `free_models_only`	Pass
`preference_dial`	`0.0`, `0.5`, `1.0` (clamped)	Pass
`prefer_latency`	`true`, `false`	Pass
`prefer_quality`	`true`, `false`	Pass
`max_cost_per_token`	`0.001` (advisory)	Pass
`max_latency_ms`	`500`	Pass
`max_tier`	`1`, `2`, `3`	Pass
`required_capabilities`	`["vision"]`	Pass
`task_complexity`	`trivial`, `standard`, `complex`, `expert`	Pass
`chain_routing`	`sticky`, `mixed`	Pass
`allow_free_fallback`	`true`	Pass

`routing_override` (4 fields)

Parameter	Values Tested	Result
`force_tier`	`T1`, `T2`, `T3`, `tier-1` (alias)	Pass
`force_lane`	`smart`, `turbo`	Pass
`force_model`	`gpt-4o`	Pass
`force_provider`	`openai` (reserved, accepted)	Pass

`xantly` Orchestration Block (18 fields)

Parameter	Values Tested	Result
`workflow_type`	`single_turn`, `execution_task`, `multi_step_conversational`, `long_horizon_autonomous`, `voice_simple`, `voice_complex`, `creative`	Pass
`chain_id`	valid UUID, invalid string (gracefully ignored)	Pass
`conversation_id`	free-form string	Pass
`planning_mode`	`preact`, `planact`	Pass
`max_chain_steps`	`1`, `65535`	Pass
`chain_timeout_secs`	`120`	Pass
`reliability_level`	`standard`, `high`, `critical`	Pass
`enable_memory`	`true`, `false`	Pass
`enable_speculation`	`true`, `false`	Pass
`enable_hedging`	`true`, `false`	Pass
`enable_cache`	`true`, `false`	Pass
`cache_ttl_secs`	`300` (reserved, accepted)	Pass
`output_verification`	`none`, `native`, `schema`, `cross_model`	Pass
`compress_context`	`true` (reserved, accepted)	Pass
`redact_pii`	`true`, `false`	Pass
`voice_mode`	`"true"`	Pass
`enable_tool_reranking`	`true` (reserved, accepted)	Pass

Legacy Request Headers (9 headers)

Header	Test Value	Result
`x-xantly-workflow`	`execution_task`	Pass
`x-xantly-voice`	`false`	Pass
`x-xantly-planning-mode`	`preact`	Pass
`x-xantly-preference`	`quality`	Pass
`x-xantly-chain-routing`	`sticky`	Pass
`x-xantly-lane`	`smart`	Pass
`x-xantly-tier`	`T2`	Pass
`x-xantly-run-id`	`run-test-123`	Pass
`x-xantly-conversation-id`	`conv-legacy-test`	Pass

Response Structure Validation

Check	Verified	Result
`id` field present	Yes	Pass
`object` = `"chat.completion"`	Yes	Pass
`created` is integer timestamp	Yes	Pass
`model` field present	Yes	Pass
`choices` non-empty with correct structure	Yes	Pass
`usage` with prompt / completion / total tokens	Yes	Pass
`xantly_metadata` with all required fields	Yes	Pass
Response headers (request-id, cache-hit, tier, lane, provider, audit-id, latency-breakdown)	Yes	Pass
Error shape (message, type, code, param)	Yes	Pass

Client SDK Compatibility

Drop-in compatible with every major AI SDK. Change the base URL, keep all your existing code.

SDK	Language	Integration Test	Result
OpenAI SDK	Python	Full chat, streaming, tool use, embeddings, audio, images, moderations	Compatible
OpenAI SDK	Node.js	Full chat, streaming, tool use, embeddings, audio, images, moderations	Compatible
OpenAI Responses API	Python	Response format, structured output, streaming	Compatible
LiteLLM	Python	Model routing, fallback, embeddings	Compatible
LangChain	Python	Chains, agents, tools, embeddings	Compatible
LangChain	Node.js	Chains, streaming	Compatible
LlamaIndex	Python	Query engine, agents	Compatible
Instructor	Python	Structured extraction	Compatible
PydanticAI	Python	Validated output	Compatible
Vercel AI SDK	TypeScript	Streaming, React hooks	Compatible

Evaluation Harness Integration

Plug Xantly directly into your existing LLM quality pipeline — no adapter needed.

Harness	Integration	Result
Promptfoo	HTTP provider adapter	Integrated
DeepEval	Gateway-level scoring	Integrated
Inspect AI	Multi-step tool task evaluation	Integrated
Arize Phoenix	Trace and span validation	Integrated
Langfuse	External LLM-as-judge	Integrated
LM Eval Harness	Standard evaluation adapter	Integrated

Agentic Framework Compatibility

Validated against real agentic loops — not just single-turn tests.

Framework	Test Type	Result
SWE-agent	`FunctionCallingParser` loop with tool retries	Compatible
OpenHands	`CodeActAgent` loop with code execution	Compatible

Both frameworks validate multi-turn tool calling, structured output, error recovery, and long-context continuity through the Xantly gateway.

Methodology

All benchmarks run against a live Xantly gateway instance. Tests follow a dataset-first architecture with three provenance tiers:

Tier	Source	Examples
Real datasets	Official benchmark slices	BFCL V4, GPQA Diamond, RAGAS
Seeded datasets	Schema-compatible examples bundled in repo	Replay traces, task variants
Proxy smoke tests	Built-in lightweight validation cases	Parameter coverage, SDK checks

Results are aggregated into a machine-readable benchmark_report.json and this human-readable summary. The full report is committed to the repository after every run — no cherry-picking.

Run It Yourself

The complete benchmark suite is open and reproducible. Run it against any gateway endpoint.

export GATEWAY_URL="https://your-gateway.xantly.com"
export GATEWAY_KEY="your-api-key"

# Run the Deep Grading Test Suite v3 (17 suites, 230 tests, ~8 minutes)
bash tests/deep-grading/deep_grading_test_v3.sh

# Run the full 4-layer suite (all 6 pillars)
./tests/benchmark/run_all_benchmarks.sh

# Run only parameter coverage (fastest — ~2 minutes)
python3 tests/benchmark/parameter_coverage/run_parameter_tests.py \
    --gateway-url "$GATEWAY_URL" \
    --gateway-key "$GATEWAY_KEY"

Results are written to tests/benchmark/results/ as JSON and regenerate this document automatically. Deep grading artifacts are saved to /tmp/deepgrade-artifacts-*.

Ready to run your own workloads on infrastructure that passes every benchmark? Get started

On this page