InferenceBench

Name: InferenceBench
Availability: OnlineOnly
Author: AISA Group

An open-source benchmark that evaluates whether frontier AI coding agents can optimize LLM serving workloads under a fixed compute budget across four inference scenarios.

Visit Website

At a Glance

Pricing

Open Source

Fully free and open-source under Apache License 2.0. Self-host on your own hardware.

Engagement

Available On

CLI

API

AISA GroupThe AISA Group conducts AI safety and research automation re…

Listed May 2026

About InferenceBench

InferenceBench is an academic benchmark created by researchers at ELLIS Institute Tübingen, Max Planck Institute for Intelligent Systems, and Tübingen AI Center. It measures whether autonomous CLI agents can act as ML systems engineers in a genuinely open-ended setting, tasked with optimizing LLM inference serving on a single NVIDIA H100 within a two-hour wall-clock budget. The project is published as a research paper and released under the Apache License 2.0.

What It Is

InferenceBench is an evaluation harness for frontier coding agents — not a product or SaaS tool, but a reproducible research benchmark. Each run gives an agent a base LLM (Mistral-7B-Instruct-v0.3), a hardware environment, and a scenario-specific objective: deliver a running, OpenAI-compatible inference server that maximizes a primary metric while passing both a quality gate and an integrity gate. The benchmark is designed to test whether agents search an open engineering space or merely retrieve memorized configurations from it.

Four Serving Scenarios

The benchmark isolates distinct bottlenecks across four scenarios:

Prefill Latency (Scenario A): Long-context prompts; measured as time to first token (TTFT). Input 8192 tokens, output 1024 tokens.
Decode Latency (Scenario B): Long generations; measured as time per output token (TPOT). Input 1024 tokens, output 8192 tokens.
Throughput (Scenario C): Concurrent traffic across burst, Poisson, and constant-rate profiles; measured in requests/second.
All-In-One (Scenario D): Balanced serving; geometric mean of latency and throughput metrics.

Gating and Integrity

Every run must pass two gates before its score counts. The quality gate requires the optimized server to score at least 95% of the PyTorch baseline accuracy on a fixed 500-question MMLU-Pro subset with greedy decoding. The integrity gate uses a judge agent to inspect transcripts and launchers for reward-hacking patterns such as returning pre-generated text, swapping the base model, or intercepting the evaluation script. The harness also performs a supervised relaunch — after the agent's session ends, the harness kills the agent's server and re-executes start_server.sh in a fresh container, so only the clean relaunch result counts.

Key Findings from 180 Runs

The benchmark's headline result, as reported in the paper, is that non-agent hyperparameter search (SMAC3, TPE, Random) given the same two-hour budget on vLLM beats every agent on every scenario. The paper reports several behavioral patterns across 180 recorded runs:

93.9% of agent runs ship a vLLM-based final launcher, even though SGLang, TGI, and TensorRT-LLM are explicitly available.
The median run launches exactly one non-default vLLM configuration over the full two-hour budget.
65.0% of runs pass both gates; 18.9% fail the quality gate; 6.1% are integrity-flagged; 10.0% fail final-server reachability.
The top-ranked agent (Claude Sonnet 4.6 via Claude Code) achieves an aggregate geometric mean speedup of 8.08× over the PyTorch baseline, compared to 11.53× for the SMAC3 search baseline.
The paper identifies the bottleneck as not domain knowledge but consistent execution: agents frequently identify relevant optimizations in transcripts but fail to validate, commit to, or preserve them in the final submitted server.

Setup and Architecture

The benchmark runs on HTCondor with Apptainer containers. Each backend (vLLM, SGLang, HuggingFace TGI, PyTorch/Transformers) has its own container definition file. API-based agents authenticate via environment variables; subscription-based agents (Codex CLI, Claude Code) use device-code login flows with credentials stored outside version control. The default submit file pins each job to one H100 80 GB GPU. The repository includes utilities for pre-caching HuggingFace model and dataset resources and for precomputing baseline scores.

Current Status

The repository was created in April 2026 and last updated in May 2026, with the paper available as a PDF on the project website. The GitHub repository has the Apache-2.0 license and is maintained by the aisa-group organization. The leaderboard on the website reflects results from 15 frontier agent configurations including Claude Sonnet 4.6, GLM-5, Gemini 3.1 Pro, and multiple GPT-5 variants.

Community Discussions

Be the first to start a conversation about InferenceBench

Share your experience with InferenceBench, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Fully free and open-source under Apache License 2.0. Self-host on your own hardware.

Full benchmark harness source code
All four inference scenarios
Quality and integrity gating
Support for vLLM, SGLang, TGI, and PyTorch backends
HTCondor job submission utilities

Capabilities

Key Features

Four inference serving scenarios: prefill latency, decode latency, throughput, and all-in-one
Quality gate using MMLU-Pro subset with greedy decoding
Integrity gate with agentic judge for reward-hacking detection
Supervised relaunch harness for clean final-server scoring
Support for vLLM, SGLang, HuggingFace TGI, and PyTorch backends
Hyperparameter search baselines (SMAC3, TPE, Random)
Time budget ablation analysis (1h, 2h, 4h, 8h)
Forced-engine comparison experiments
HTCondor job submission with Apptainer containers
Support for API-based and subscription-based agents (Claude Code, Codex CLI)
Pre-caching utilities for HuggingFace models and datasets
Leaderboard with per-scenario and aggregate speedup metrics

Integrations

vLLM

SGLang

HuggingFace TGI

PyTorch

Claude Code

Codex CLI

OpenAI API

Anthropic API

Google Gemini API

HTCondor

Apptainer

MMLU-Pro

API Available

View Docs

Back to all tools Suggest an edit