EveryDev.ai
Sign inSubscribe
Explore AI Tools
  • AI Coding Assistants
  • Agent Frameworks
  • MCP Servers
  • AI Prompt Tools
  • Vibe Coding Tools
  • AI Design Tools
  • AI Database Tools
  • AI Website Builders
  • AI Testing Tools
  • LLM Evaluations
Follow Us
  • X / Twitter
  • LinkedIn
  • Reddit
  • Discord
  • Threads
  • Bluesky
  • Mastodon
  • YouTube
  • GitHub
  • Instagram
Get Started
  • About
  • Editorial Standards
  • Corrections & Disclosures
  • Community Guidelines
  • Advertise
  • Contact Us
  • Newsletter
  • Submit a Tool
  • Start a Discussion
  • Write A Blog
  • Share A Build
  • Terms of Service
  • Privacy Policy
Explore with AI
  • ChatGPT
  • Gemini
  • Claude
  • Grok
  • Perplexity
Agent Experience
  • llms.txt
Theme
With AI, Everyone is a Dev. EveryDev.ai © 2026
Main Menu
  • Tools
  • Developers
  • Topics
  • Discussions
  • Communities
  • News
  • Podcasts
  • Blogs
  • Builds
  • Contests
  • Compare
  • Arena
Create
    Home
    Tools

    2,490+ AI tools

    • New
    • Trending
    • Featured
    • Compare
    • Arena
    Categories
    • Agents1655
    • Coding1204
    • Infrastructure536
    • Marketing448
    • Design430
    • Projects388
    • Research368
    • Analytics335
    • Testing230
    • MCP225
    • Data210
    • Security198
    • Integration169
    • Learning155
    • Communication148
    • Prompts144
    • Extensions137
    • Commerce125
    • Voice122
    • DevOps99
    • Web78
    • Finance21
    1. Home
    2. Tools
    3. InferenceBench
    InferenceBench icon

    InferenceBench

    LLM Evaluations

    An open-source benchmark that evaluates whether frontier AI coding agents can optimize LLM serving workloads under a fixed compute budget across four inference scenarios.

    Visit Website

    At a Glance

    Pricing
    Open Source

    Fully free and open-source under Apache License 2.0. Self-host on your own hardware.

    Engagement

    Available On

    CLI
    API

    Resources

    WebsiteDocsGitHubllms.txt

    Topics

    LLM EvaluationsAgent HarnessAI Infrastructure

    Alternatives

    llmfitZeroEvalProgramBench
    Developer
    AISA GroupThe AISA Group conducts AI safety and research automation re…

    Listed May 2026

    About InferenceBench

    InferenceBench is an academic benchmark created by researchers at ELLIS Institute Tübingen, Max Planck Institute for Intelligent Systems, and Tübingen AI Center. It measures whether autonomous CLI agents can act as ML systems engineers in a genuinely open-ended setting, tasked with optimizing LLM inference serving on a single NVIDIA H100 within a two-hour wall-clock budget. The project is published as a research paper and released under the Apache License 2.0.

    What It Is

    InferenceBench is an evaluation harness for frontier coding agents — not a product or SaaS tool, but a reproducible research benchmark. Each run gives an agent a base LLM (Mistral-7B-Instruct-v0.3), a hardware environment, and a scenario-specific objective: deliver a running, OpenAI-compatible inference server that maximizes a primary metric while passing both a quality gate and an integrity gate. The benchmark is designed to test whether agents search an open engineering space or merely retrieve memorized configurations from it.

    Four Serving Scenarios

    The benchmark isolates distinct bottlenecks across four scenarios:

    • Prefill Latency (Scenario A): Long-context prompts; measured as time to first token (TTFT). Input 8192 tokens, output 1024 tokens.
    • Decode Latency (Scenario B): Long generations; measured as time per output token (TPOT). Input 1024 tokens, output 8192 tokens.
    • Throughput (Scenario C): Concurrent traffic across burst, Poisson, and constant-rate profiles; measured in requests/second.
    • All-In-One (Scenario D): Balanced serving; geometric mean of latency and throughput metrics.

    Gating and Integrity

    Every run must pass two gates before its score counts. The quality gate requires the optimized server to score at least 95% of the PyTorch baseline accuracy on a fixed 500-question MMLU-Pro subset with greedy decoding. The integrity gate uses a judge agent to inspect transcripts and launchers for reward-hacking patterns such as returning pre-generated text, swapping the base model, or intercepting the evaluation script. The harness also performs a supervised relaunch — after the agent's session ends, the harness kills the agent's server and re-executes start_server.sh in a fresh container, so only the clean relaunch result counts.

    Key Findings from 180 Runs

    The benchmark's headline result, as reported in the paper, is that non-agent hyperparameter search (SMAC3, TPE, Random) given the same two-hour budget on vLLM beats every agent on every scenario. The paper reports several behavioral patterns across 180 recorded runs:

    • 93.9% of agent runs ship a vLLM-based final launcher, even though SGLang, TGI, and TensorRT-LLM are explicitly available.
    • The median run launches exactly one non-default vLLM configuration over the full two-hour budget.
    • 65.0% of runs pass both gates; 18.9% fail the quality gate; 6.1% are integrity-flagged; 10.0% fail final-server reachability.
    • The top-ranked agent (Claude Sonnet 4.6 via Claude Code) achieves an aggregate geometric mean speedup of 8.08× over the PyTorch baseline, compared to 11.53× for the SMAC3 search baseline.
    • The paper identifies the bottleneck as not domain knowledge but consistent execution: agents frequently identify relevant optimizations in transcripts but fail to validate, commit to, or preserve them in the final submitted server.

    Setup and Architecture

    The benchmark runs on HTCondor with Apptainer containers. Each backend (vLLM, SGLang, HuggingFace TGI, PyTorch/Transformers) has its own container definition file. API-based agents authenticate via environment variables; subscription-based agents (Codex CLI, Claude Code) use device-code login flows with credentials stored outside version control. The default submit file pins each job to one H100 80 GB GPU. The repository includes utilities for pre-caching HuggingFace model and dataset resources and for precomputing baseline scores.

    Current Status

    The repository was created in April 2026 and last updated in May 2026, with the paper available as a PDF on the project website. The GitHub repository has the Apache-2.0 license and is maintained by the aisa-group organization. The leaderboard on the website reflects results from 15 frontier agent configurations including Claude Sonnet 4.6, GLM-5, Gemini 3.1 Pro, and multiple GPT-5 variants.

    InferenceBench - 1

    Community Discussions

    Be the first to start a conversation about InferenceBench

    Share your experience with InferenceBench, ask questions, or help others learn from your insights.

    Pricing

    OPEN SOURCE

    Open Source

    Fully free and open-source under Apache License 2.0. Self-host on your own hardware.

    • Full benchmark harness source code
    • All four inference scenarios
    • Quality and integrity gating
    • Support for vLLM, SGLang, TGI, and PyTorch backends
    • HTCondor job submission utilities

    Capabilities

    Key Features

    • Four inference serving scenarios: prefill latency, decode latency, throughput, and all-in-one
    • Quality gate using MMLU-Pro subset with greedy decoding
    • Integrity gate with agentic judge for reward-hacking detection
    • Supervised relaunch harness for clean final-server scoring
    • Support for vLLM, SGLang, HuggingFace TGI, and PyTorch backends
    • Hyperparameter search baselines (SMAC3, TPE, Random)
    • Time budget ablation analysis (1h, 2h, 4h, 8h)
    • Forced-engine comparison experiments
    • HTCondor job submission with Apptainer containers
    • Support for API-based and subscription-based agents (Claude Code, Codex CLI)
    • Pre-caching utilities for HuggingFace models and datasets
    • Leaderboard with per-scenario and aggregate speedup metrics

    Integrations

    vLLM
    SGLang
    HuggingFace TGI
    PyTorch
    Claude Code
    Codex CLI
    OpenAI API
    Anthropic API
    Google Gemini API
    HTCondor
    Apptainer
    MMLU-Pro
    API Available
    View Docs

    Reviews & Ratings

    No ratings yet

    Be the first to rate InferenceBench and help others make informed decisions.

    Developer

    AISA Group

    The AISA Group conducts AI safety and research automation research at ELLIS Institute Tübingen, Max Planck Institute for Intelligent Systems, and Tübingen AI Center. The group builds open-source benchmarks and evaluation frameworks to measure the capabilities and failure modes of frontier AI agents. InferenceBench is their benchmark for evaluating whether coding agents can act as ML systems engineers in open-ended inference optimization tasks.

    Read more about AISA Group
    WebsiteGitHub
    1 tool in directory

    Similar Tools

    llmfit icon

    llmfit

    LLMFit is an open-source CLI tool for benchmarking and evaluating the performance of large language models across various tasks.

    ZeroEval icon

    ZeroEval

    Open-source evaluation framework for testing large language models with zero-shot prompting on reasoning and coding tasks.

    ProgramBench icon

    ProgramBench

    A benchmark that tests whether AI agents can rebuild real-world programs from scratch given only a compiled binary and its documentation, with no access to source code.

    Browse all tools

    Related Topics

    LLM Evaluations

    Platforms and frameworks for evaluating, testing, and benchmarking LLM systems and AI applications. These tools provide evaluators and evaluation models to score AI outputs, measure hallucinations, assess RAG quality, detect failures, and optimize model performance. Features include automated testing with LLM-as-a-judge metrics, component-level evaluation with tracing, regression testing in CI/CD pipelines, custom evaluator creation, dataset curation, and real-time monitoring of production systems. Teams use these solutions to validate prompt effectiveness, compare models side-by-side, ensure answer correctness and relevance, identify bias and toxicity, prevent PII leakage, and continuously improve AI product quality through experiments, benchmarks, and performance analytics.

    85 tools

    Agent Harness

    Infrastructure, orchestrators, and task runners that wrap around LLM coding agents — covering session management, context delivery, worktree isolation, architecture enforcement, and issue-to-PR pipelines.

    85 tools

    AI Infrastructure

    Infrastructure designed for deploying and running AI models.

    243 tools
    Browse all topics
    Back to all tools
    Discussions