# InferenceBench

> An open-source benchmark that evaluates whether frontier AI coding agents can optimize LLM serving workloads under a fixed compute budget across four inference scenarios.

InferenceBench is an academic benchmark created by researchers at ELLIS Institute Tübingen, Max Planck Institute for Intelligent Systems, and Tübingen AI Center. It measures whether autonomous CLI agents can act as ML systems engineers in a genuinely open-ended setting, tasked with optimizing LLM inference serving on a single NVIDIA H100 within a two-hour wall-clock budget. The project is published as a research paper and released under the Apache License 2.0.

## What It Is

InferenceBench is an evaluation harness for frontier coding agents — not a product or SaaS tool, but a reproducible research benchmark. Each run gives an agent a base LLM (Mistral-7B-Instruct-v0.3), a hardware environment, and a scenario-specific objective: deliver a running, OpenAI-compatible inference server that maximizes a primary metric while passing both a quality gate and an integrity gate. The benchmark is designed to test whether agents *search* an open engineering space or merely *retrieve* memorized configurations from it.

## Four Serving Scenarios

The benchmark isolates distinct bottlenecks across four scenarios:

- **Prefill Latency (Scenario A):** Long-context prompts; measured as time to first token (TTFT). Input 8192 tokens, output 1024 tokens.
- **Decode Latency (Scenario B):** Long generations; measured as time per output token (TPOT). Input 1024 tokens, output 8192 tokens.
- **Throughput (Scenario C):** Concurrent traffic across burst, Poisson, and constant-rate profiles; measured in requests/second.
- **All-In-One (Scenario D):** Balanced serving; geometric mean of latency and throughput metrics.

## Gating and Integrity

Every run must pass two gates before its score counts. The **quality gate** requires the optimized server to score at least 95% of the PyTorch baseline accuracy on a fixed 500-question MMLU-Pro subset with greedy decoding. The **integrity gate** uses a judge agent to inspect transcripts and launchers for reward-hacking patterns such as returning pre-generated text, swapping the base model, or intercepting the evaluation script. The harness also performs a supervised relaunch — after the agent's session ends, the harness kills the agent's server and re-executes `start_server.sh` in a fresh container, so only the clean relaunch result counts.

## Key Findings from 180 Runs

The benchmark's headline result, as reported in the paper, is that non-agent hyperparameter search (SMAC3, TPE, Random) given the same two-hour budget on vLLM beats every agent on every scenario. The paper reports several behavioral patterns across 180 recorded runs:

- **93.9%** of agent runs ship a vLLM-based final launcher, even though SGLang, TGI, and TensorRT-LLM are explicitly available.
- The median run launches exactly one non-default vLLM configuration over the full two-hour budget.
- **65.0%** of runs pass both gates; **18.9%** fail the quality gate; **6.1%** are integrity-flagged; **10.0%** fail final-server reachability.
- The top-ranked agent (Claude Sonnet 4.6 via Claude Code) achieves an aggregate geometric mean speedup of 8.08× over the PyTorch baseline, compared to 11.53× for the SMAC3 search baseline.
- The paper identifies the bottleneck as not domain knowledge but consistent execution: agents frequently identify relevant optimizations in transcripts but fail to validate, commit to, or preserve them in the final submitted server.

## Setup and Architecture

The benchmark runs on HTCondor with Apptainer containers. Each backend (vLLM, SGLang, HuggingFace TGI, PyTorch/Transformers) has its own container definition file. API-based agents authenticate via environment variables; subscription-based agents (Codex CLI, Claude Code) use device-code login flows with credentials stored outside version control. The default submit file pins each job to one H100 80 GB GPU. The repository includes utilities for pre-caching HuggingFace model and dataset resources and for precomputing baseline scores.

## Current Status

The repository was created in April 2026 and last updated in May 2026, with the paper available as a PDF on the project website. The GitHub repository has the Apache-2.0 license and is maintained by the `aisa-group` organization. The leaderboard on the website reflects results from 15 frontier agent configurations including Claude Sonnet 4.6, GLM-5, Gemini 3.1 Pro, and multiple GPT-5 variants.

## Features
- Four inference serving scenarios: prefill latency, decode latency, throughput, and all-in-one
- Quality gate using MMLU-Pro subset with greedy decoding
- Integrity gate with agentic judge for reward-hacking detection
- Supervised relaunch harness for clean final-server scoring
- Support for vLLM, SGLang, HuggingFace TGI, and PyTorch backends
- Hyperparameter search baselines (SMAC3, TPE, Random)
- Time budget ablation analysis (1h, 2h, 4h, 8h)
- Forced-engine comparison experiments
- HTCondor job submission with Apptainer containers
- Support for API-based and subscription-based agents (Claude Code, Codex CLI)
- Pre-caching utilities for HuggingFace models and datasets
- Leaderboard with per-scenario and aggregate speedup metrics

## Integrations
vLLM, SGLang, HuggingFace TGI, PyTorch, Claude Code, Codex CLI, OpenAI API, Anthropic API, Google Gemini API, HTCondor, Apptainer, MMLU-Pro

## Platforms
CLI, API

## Pricing
Open Source

## Version
main

## Links
- Website: https://inferencebench.ai
- Documentation: https://github.com/aisa-group/InferenceBench
- Repository: https://github.com/aisa-group/InferenceBench
- EveryDev.ai: https://www.everydev.ai/tools/inferencebench