InferenceBench
An open-source benchmark that evaluates whether frontier AI coding agents can optimize LLM serving workloads under a fixed compute budget across four inference scenarios.
At a Glance
Fully free and open-source under Apache License 2.0. Self-host on your own hardware.
Engagement
Available On
Alternatives
Listed May 2026
About InferenceBench
InferenceBench is an academic benchmark created by researchers at ELLIS Institute Tübingen, Max Planck Institute for Intelligent Systems, and Tübingen AI Center. It measures whether autonomous CLI agents can act as ML systems engineers in a genuinely open-ended setting, tasked with optimizing LLM inference serving on a single NVIDIA H100 within a two-hour wall-clock budget. The project is published as a research paper and released under the Apache License 2.0.
What It Is
InferenceBench is an evaluation harness for frontier coding agents — not a product or SaaS tool, but a reproducible research benchmark. Each run gives an agent a base LLM (Mistral-7B-Instruct-v0.3), a hardware environment, and a scenario-specific objective: deliver a running, OpenAI-compatible inference server that maximizes a primary metric while passing both a quality gate and an integrity gate. The benchmark is designed to test whether agents search an open engineering space or merely retrieve memorized configurations from it.
Four Serving Scenarios
The benchmark isolates distinct bottlenecks across four scenarios:
- Prefill Latency (Scenario A): Long-context prompts; measured as time to first token (TTFT). Input 8192 tokens, output 1024 tokens.
- Decode Latency (Scenario B): Long generations; measured as time per output token (TPOT). Input 1024 tokens, output 8192 tokens.
- Throughput (Scenario C): Concurrent traffic across burst, Poisson, and constant-rate profiles; measured in requests/second.
- All-In-One (Scenario D): Balanced serving; geometric mean of latency and throughput metrics.
Gating and Integrity
Every run must pass two gates before its score counts. The quality gate requires the optimized server to score at least 95% of the PyTorch baseline accuracy on a fixed 500-question MMLU-Pro subset with greedy decoding. The integrity gate uses a judge agent to inspect transcripts and launchers for reward-hacking patterns such as returning pre-generated text, swapping the base model, or intercepting the evaluation script. The harness also performs a supervised relaunch — after the agent's session ends, the harness kills the agent's server and re-executes start_server.sh in a fresh container, so only the clean relaunch result counts.
Key Findings from 180 Runs
The benchmark's headline result, as reported in the paper, is that non-agent hyperparameter search (SMAC3, TPE, Random) given the same two-hour budget on vLLM beats every agent on every scenario. The paper reports several behavioral patterns across 180 recorded runs:
- 93.9% of agent runs ship a vLLM-based final launcher, even though SGLang, TGI, and TensorRT-LLM are explicitly available.
- The median run launches exactly one non-default vLLM configuration over the full two-hour budget.
- 65.0% of runs pass both gates; 18.9% fail the quality gate; 6.1% are integrity-flagged; 10.0% fail final-server reachability.
- The top-ranked agent (Claude Sonnet 4.6 via Claude Code) achieves an aggregate geometric mean speedup of 8.08× over the PyTorch baseline, compared to 11.53× for the SMAC3 search baseline.
- The paper identifies the bottleneck as not domain knowledge but consistent execution: agents frequently identify relevant optimizations in transcripts but fail to validate, commit to, or preserve them in the final submitted server.
Setup and Architecture
The benchmark runs on HTCondor with Apptainer containers. Each backend (vLLM, SGLang, HuggingFace TGI, PyTorch/Transformers) has its own container definition file. API-based agents authenticate via environment variables; subscription-based agents (Codex CLI, Claude Code) use device-code login flows with credentials stored outside version control. The default submit file pins each job to one H100 80 GB GPU. The repository includes utilities for pre-caching HuggingFace model and dataset resources and for precomputing baseline scores.
Current Status
The repository was created in April 2026 and last updated in May 2026, with the paper available as a PDF on the project website. The GitHub repository has the Apache-2.0 license and is maintained by the aisa-group organization. The leaderboard on the website reflects results from 15 frontier agent configurations including Claude Sonnet 4.6, GLM-5, Gemini 3.1 Pro, and multiple GPT-5 variants.
Community Discussions
Be the first to start a conversation about InferenceBench
Share your experience with InferenceBench, ask questions, or help others learn from your insights.
Pricing
Open Source
Fully free and open-source under Apache License 2.0. Self-host on your own hardware.
- Full benchmark harness source code
- All four inference scenarios
- Quality and integrity gating
- Support for vLLM, SGLang, TGI, and PyTorch backends
- HTCondor job submission utilities
Capabilities
Key Features
- Four inference serving scenarios: prefill latency, decode latency, throughput, and all-in-one
- Quality gate using MMLU-Pro subset with greedy decoding
- Integrity gate with agentic judge for reward-hacking detection
- Supervised relaunch harness for clean final-server scoring
- Support for vLLM, SGLang, HuggingFace TGI, and PyTorch backends
- Hyperparameter search baselines (SMAC3, TPE, Random)
- Time budget ablation analysis (1h, 2h, 4h, 8h)
- Forced-engine comparison experiments
- HTCondor job submission with Apptainer containers
- Support for API-based and subscription-based agents (Claude Code, Codex CLI)
- Pre-caching utilities for HuggingFace models and datasets
- Leaderboard with per-scenario and aggregate speedup metrics
