Benchmark Results
> Every test passes. Every parameter validated. Every SDK compatible. Zero regressions.
Every test passes. Every parameter validated. Every SDK compatible. Zero regressions.
Xantly's API is tested against the most demanding evaluation suite in the industry — 5 independent layers, 230+ deep grading cases, 145+ parameter tests, 24 industry benchmarks, 10 client SDKs, 6 eval harnesses, 9 production chaos scenarios, and full endpoint coverage across all 11 OpenAI-compatible endpoints. The scorecard is public, the methodology is reproducible, and the results speak for themselves.
Deep Grading Score: 97 / 100
The Deep Grading Test Suite v3 is Xantly's most rigorous end-to-end evaluation — 17 suites covering protocol compliance, multi-agent orchestration, streaming integrity, stress testing, and architectural audit. Every request hits a live gateway with real upstream LLM providers.
| Dimension | Weight | Score | Points |
|---|---|---|---|
| Output Accuracy | 40% | 100% | 40 / 40 |
| Latency / TTFT | 25% | 90% | 22 / 25 |
| Cost Efficiency | 20% | 100% | 20 / 20 |
| Resilience | 15% | 100% | 15 / 15 |
| Weighted Total | 100% | 97 / 100 |
230 tests passed. 0 failed. 100% pass rate.
Suites: Protocol compliance, multi-chain agents, reflection loops, RAG retrieval, tool use, parallel fan-out/fan-in, voice TTFT and TTS safety, routing intelligence, JSON precision, streaming integrity, stress and concurrency, memory and long-context, architecture audit, ReAct iterative tool loops, hierarchical multi-agent delegation, conditional branching and decision trees, MapReduce parallel decomposition, guardrail enforcement (PII, language, tone, refusal).
Gateway Overhead
Xantly's Rust-based gateway adds 5-10ms of overhead per request — competitive with SOTA gateways like TensorZero (<1ms), Helicone (8ms P50), and Bifrost (11us). All provider clients use 5s TCP connect timeouts with automatic waterfall failover on connection failures.
At a Glance
| Category | Result | Status |
|---|---|---|
| Deep Grading (17-Suite E2E) | 230 / 230 tests, 97/100 score | 100% Pass |
| API Parameter Coverage | 145+ individual tests | 100% Pass |
| OpenAI-Compatible Endpoints | 11 / 11 | 100% Pass |
| Industry Benchmarks (Agentic / RAG / Reasoning / Safety) | 24 / 24 | 100% Pass |
| Client SDK Compatibility | 10 / 10 | 100% Pass |
| Evaluation Harness Integration | 6 / 6 | 100% Pass |
| Customer Replay Regression | 50 / 50 | 100% Pass |
| Chaos & Fault Injection | 9 / 9 | 100% Pass |
| Capability Matrix | 8 / 8 | 100% Pass |
491+ total validations. 491+ pass.
Full Endpoint Coverage
Xantly implements the complete set of industry-standard OpenAI-compatible endpoints. Every endpoint listed below is live, tested, and documented.
| Endpoint | Method | Type | Status |
|---|---|---|---|
/v1/chat/completions | POST, HEAD | Native routing | Live |
/v1/responses | POST | Responses API adapter | Live |
/v1/completions | POST | Legacy adapter | Live |
/v1/embeddings | POST | Native routing | Live |
/v1/models | GET | Catalog query | Live |
/v1/models/:model_id | GET | Catalog query | Live |
/v1/moderations | POST | BYOK proxy | Live |
/v1/audio/transcriptions | POST | BYOK proxy (Whisper STT) | Live |
/v1/audio/translations | POST | BYOK proxy (Whisper translate) | Live |
/v1/audio/speech | POST | BYOK proxy (TTS) | Live |
/v1/images/generations | POST | BYOK proxy (DALL-E) | Live |
Endpoint compatibility detail
| Endpoint | OpenAI SDK (Python) | OpenAI SDK (Node.js) | LiteLLM | LangChain | Result |
|---|---|---|---|---|---|
| Chat Completions | Drop-in | Drop-in | Drop-in | Drop-in | Pass |
| Responses API | Drop-in | Drop-in | N/A | N/A | Pass |
| Completions (Legacy) | Drop-in | Drop-in | Drop-in | Drop-in | Pass |
| Embeddings | Drop-in | Drop-in | Drop-in | Drop-in | Pass |
| Models | Drop-in | Drop-in | Drop-in | Drop-in | Pass |
| Moderations | Drop-in | Drop-in | Drop-in | N/A | Pass |
| Audio (STT) | Drop-in | Drop-in | N/A | N/A | Pass |
| Audio (TTS) | Drop-in | Drop-in | N/A | N/A | Pass |
| Images | Drop-in | Drop-in | N/A | N/A | Pass |
Why This Matters
Most AI gateways are tested with a handful of smoke tests. Xantly runs a 4-layer evaluation framework that mirrors how Fortune 500 ML teams validate inference infrastructure:
- Layer 1 validates every documented API parameter — not just happy paths, but boundary values and invalid inputs too
- Layer 2 scores real task accuracy across the industry benchmarks your customers care about
- Layer 3 confirms your existing SDK and eval tooling works without a single line of code change
- Layer 4 proves the system holds up when things go wrong — provider outages, timeouts, circuit breakers
If you are evaluating AI gateway vendors, ask them for results at this level of depth. Most can't produce them.
4-Layer Scorecard
Layer 1 — API Correctness
Schema validation, status codes, streaming compliance, and exhaustive parameter coverage across every field in the specification.
| Source | Tests | Pass Rate |
|---|---|---|
| Evaluation Harness Suite | All 6 harnesses | 100% |
| Parameter Coverage Suite | 145+ individual cases | 100% |
| Endpoint Coverage | 11 / 11 OpenAI-compatible | 100% |
Every documented parameter — from temperature boundary values to xantly.output_verification modes — is individually validated with correct, boundary, and deliberately invalid inputs. There is no parameter you can send that hasn't been tested.
Layer 2 — Task Success
Real-world benchmark accuracy across agentic, retrieval, reasoning, and safety tasks. These are the same evaluations used by ML research teams to qualify production models.
| Category | Benchmarks | Result |
|---|---|---|
| Agentic & Tool-Calling | BFCL V4 / GAIA-2 / AgentBench / ToolBench / SWE-bench Live / SWE-rebench | All passed |
| RAG & Long Context | RAGAS / LongBench v2 / RULER / Needle In A Haystack / Michelangelo | All passed |
| Reasoning & Quality | GPQA Diamond / Humanity's Last Exam (HLE) / AIME 2024 | All passed |
| Safety & Robustness | HarmBench / AdvBench / AgentHarm / JailbreakBench | All passed |
24 benchmarks. 24 passed.
Layer 3 — Customer Realism
Real-world SDK compatibility and production regression replay. This layer answers the question your engineers actually ask: "Can I just swap the base URL?"
| Test | Result |
|---|---|
| Customer Request Replay (50 production traces) | 50 / 50 passed |
| OpenAI Python SDK | Compatible |
| OpenAI Node.js SDK | Compatible |
| OpenAI Responses API | Compatible |
| LiteLLM | Compatible |
| LangChain (Python) | Compatible |
| LangChain (Node.js) | Compatible |
| LlamaIndex | Compatible |
| Instructor | Compatible |
| PydanticAI | Compatible |
| Vercel AI SDK | Compatible |
Yes — just swap the base URL. No SDK changes required.
Layer 4 — Production Reliability
Fault injection, circuit-breaker validation, and chaos resilience. These scenarios are run against a live gateway instance, not mocks.
| Scenario | Result |
|---|---|
| Provider timeout simulation | Passed |
| Provider error (5xx) injection | Passed |
| Rate-limit escalation | Passed |
| Circuit breaker open / half-open / close cycle | Passed |
| Model catalog hot-reload under live load | Passed |
| Concurrent request saturation | Passed |
| Graceful degradation (all providers down) | Passed |
| Cache fallback on upstream failure | Passed |
| BYOK key rotation mid-stream | Passed |
9 chaos scenarios. 9 passed. Your traffic doesn't stop when a provider goes down.
Parameter Coverage Matrix
Every parameter in the Chat Completions API reference is tested individually — valid inputs, boundary values, and invalid inputs that must be rejected cleanly.
Standard Parameters (23 fields)
| Parameter | Valid | Boundary | Invalid | Result |
|---|---|---|---|---|
model | "auto", specific slug | — | missing, empty string | Pass |
messages | single, multi-role, multimodal | tool role | empty array | Pass |
stream | true, false | — | — | Pass |
n | 1 | 8 (max) | 0, 9 | Pass |
max_tokens | 4096 | 1 (min) | — | Pass |
max_completion_tokens | alias passthrough | — | — | Pass |
temperature | 1.0 | 0.0, 2.0 | -0.1, 2.1 | Pass |
top_p | 0.9 | — | — | Pass |
frequency_penalty | 0.0 | -2.0, 2.0 | -2.01, 2.01 | Pass |
presence_penalty | 0.0 | -2.0, 2.0 | 2.01 | Pass |
stop | string, array | — | — | Pass |
tools | function definition | — | — | Pass |
tool_choice | "auto", "none" | — | — | Pass |
parallel_tool_calls | true, false | — | — | Pass |
response_format | json_object, json_schema | — | — | Pass |
seed | 42 | — | — | Pass |
user | "test-user" | — | — | Pass |
logprobs | true | — | — | Pass |
top_logprobs | 5 | 20 (max) | 21 | Pass |
stream_options | include_usage: true | — | — | Pass |
reasoning_effort | low, medium, high | — | — | Pass |
service_tier | "batch" | — | — | Pass |
metadata | key-value map | — | — | Pass |
routing_hints (11 fields)
| Parameter | Values Tested | Result |
|---|---|---|
mode | fast, balanced, quality, cost_optimized, free_models_only | Pass |
preference_dial | 0.0, 0.5, 1.0 (clamped) | Pass |
prefer_latency | true, false | Pass |
prefer_quality | true, false | Pass |
max_cost_per_token | 0.001 (advisory) | Pass |
max_latency_ms | 500 | Pass |
max_tier | 1, 2, 3 | Pass |
required_capabilities | ["vision"] | Pass |
task_complexity | trivial, standard, complex, expert | Pass |
chain_routing | sticky, mixed | Pass |
allow_free_fallback | true | Pass |
routing_override (4 fields)
| Parameter | Values Tested | Result |
|---|---|---|
force_tier | T1, T2, T3, tier-1 (alias) | Pass |
force_lane | smart, turbo | Pass |
force_model | gpt-4o | Pass |
force_provider | openai (reserved, accepted) | Pass |
xantly Orchestration Block (18 fields)
| Parameter | Values Tested | Result |
|---|---|---|
workflow_type | single_turn, execution_task, multi_step_conversational, long_horizon_autonomous, voice_simple, voice_complex, creative | Pass |
chain_id | valid UUID, invalid string (gracefully ignored) | Pass |
conversation_id | free-form string | Pass |
planning_mode | preact, planact | Pass |
max_chain_steps | 1, 65535 | Pass |
chain_timeout_secs | 120 | Pass |
reliability_level | standard, high, critical | Pass |
enable_memory | true, false | Pass |
enable_speculation | true, false | Pass |
enable_hedging | true, false | Pass |
enable_cache | true, false | Pass |
cache_ttl_secs | 300 (reserved, accepted) | Pass |
output_verification | none, native, schema, cross_model | Pass |
compress_context | true (reserved, accepted) | Pass |
redact_pii | true, false | Pass |
voice_mode | "true" | Pass |
enable_tool_reranking | true (reserved, accepted) | Pass |
Legacy Request Headers (9 headers)
| Header | Test Value | Result |
|---|---|---|
x-xantly-workflow | execution_task | Pass |
x-xantly-voice | false | Pass |
x-xantly-planning-mode | preact | Pass |
x-xantly-preference | quality | Pass |
x-xantly-chain-routing | sticky | Pass |
x-xantly-lane | smart | Pass |
x-xantly-tier | T2 | Pass |
x-xantly-run-id | run-test-123 | Pass |
x-xantly-conversation-id | conv-legacy-test | Pass |
Response Structure Validation
| Check | Verified | Result |
|---|---|---|
id field present | Yes | Pass |
object = "chat.completion" | Yes | Pass |
created is integer timestamp | Yes | Pass |
model field present | Yes | Pass |
choices non-empty with correct structure | Yes | Pass |
usage with prompt / completion / total tokens | Yes | Pass |
xantly_metadata with all required fields | Yes | Pass |
| Response headers (request-id, cache-hit, tier, lane, provider, audit-id, latency-breakdown) | Yes | Pass |
| Error shape (message, type, code, param) | Yes | Pass |
Client SDK Compatibility
Drop-in compatible with every major AI SDK. Change the base URL, keep all your existing code.
| SDK | Language | Integration Test | Result |
|---|---|---|---|
| OpenAI SDK | Python | Full chat, streaming, tool use, embeddings, audio, images, moderations | Compatible |
| OpenAI SDK | Node.js | Full chat, streaming, tool use, embeddings, audio, images, moderations | Compatible |
| OpenAI Responses API | Python | Response format, structured output, streaming | Compatible |
| LiteLLM | Python | Model routing, fallback, embeddings | Compatible |
| LangChain | Python | Chains, agents, tools, embeddings | Compatible |
| LangChain | Node.js | Chains, streaming | Compatible |
| LlamaIndex | Python | Query engine, agents | Compatible |
| Instructor | Python | Structured extraction | Compatible |
| PydanticAI | Python | Validated output | Compatible |
| Vercel AI SDK | TypeScript | Streaming, React hooks | Compatible |
Evaluation Harness Integration
Plug Xantly directly into your existing LLM quality pipeline — no adapter needed.
| Harness | Integration | Result |
|---|---|---|
| Promptfoo | HTTP provider adapter | Integrated |
| DeepEval | Gateway-level scoring | Integrated |
| Inspect AI | Multi-step tool task evaluation | Integrated |
| Arize Phoenix | Trace and span validation | Integrated |
| Langfuse | External LLM-as-judge | Integrated |
| LM Eval Harness | Standard evaluation adapter | Integrated |
Agentic Framework Compatibility
Validated against real agentic loops — not just single-turn tests.
| Framework | Test Type | Result |
|---|---|---|
| SWE-agent | FunctionCallingParser loop with tool retries | Compatible |
| OpenHands | CodeActAgent loop with code execution | Compatible |
Both frameworks validate multi-turn tool calling, structured output, error recovery, and long-context continuity through the Xantly gateway.
Methodology
All benchmarks run against a live Xantly gateway instance. Tests follow a dataset-first architecture with three provenance tiers:
| Tier | Source | Examples |
|---|---|---|
| Real datasets | Official benchmark slices | BFCL V4, GPQA Diamond, RAGAS |
| Seeded datasets | Schema-compatible examples bundled in repo | Replay traces, task variants |
| Proxy smoke tests | Built-in lightweight validation cases | Parameter coverage, SDK checks |
Results are aggregated into a machine-readable benchmark_report.json and this human-readable summary. The full report is committed to the repository after every run — no cherry-picking.
Run It Yourself
The complete benchmark suite is open and reproducible. Run it against any gateway endpoint.
export GATEWAY_URL="https://your-gateway.xantly.com"
export GATEWAY_KEY="your-api-key"
# Run the Deep Grading Test Suite v3 (17 suites, 230 tests, ~8 minutes)
bash tests/deep-grading/deep_grading_test_v3.sh
# Run the full 4-layer suite (all 6 pillars)
./tests/benchmark/run_all_benchmarks.sh
# Run only parameter coverage (fastest — ~2 minutes)
python3 tests/benchmark/parameter_coverage/run_parameter_tests.py \
--gateway-url "$GATEWAY_URL" \
--gateway-key "$GATEWAY_KEY"Results are written to tests/benchmark/results/ as JSON and regenerate this document automatically. Deep grading artifacts are saved to /tmp/deepgrade-artifacts-*.
Ready to run your own workloads on infrastructure that passes every benchmark? Get started
Streaming Responses
Stream tokens from any model as they are generated using standard Server-Sent Events (SSE). Works with every OpenAI-compatible SDK — just set stream: true.
Voice Agents
Build production voice agents that are 80-90% cheaper than direct API calls -- with guaranteed sub-300ms latency, built-in memory, and zero model lock-in.