VitaBench
An open-source benchmark for evaluating LLM agents on versatile interactive tasks grounded in real-world life-serving applications like food delivery, in-store consumption, and online travel.
At a Glance
Fully free and open-source under the MIT License. Clone, use, modify, and distribute freely.
Engagement
Available On
Alternatives
Listed May 2026
About VitaBench
VitaBench is an open-source benchmark developed by the Meituan LongCat Team that evaluates LLM-based agents on complex, multi-turn interactive tasks drawn from real-world daily service scenarios. It was accepted to ICLR 2026 and is freely available on GitHub under the MIT License, with the dataset hosted on Hugging Face.
What It Is
VitaBench (where "Vita" derives from the Latin word for "Life") is a research benchmark designed to stress-test LLM agents in realistic, life-serving simulation environments. Unlike simpler benchmarks, it draws from three real-world application domains — food delivery, in-store consumption, and online travel services (OTA) — and presents agents with 66 tools, 100 cross-scenario tasks, and 300 single-scenario tasks. Each task requires agents to reason across temporal and spatial dimensions, handle complex tool sets, proactively clarify ambiguous instructions, and track shifting user intent across multi-turn conversations.
Benchmark Architecture
VitaBench is built through a two-stage pipeline:
- Stage I (Framework Design): Real life-serving scenarios are abstracted into a directed graph of simplified API tools with explicit pre- and post-conditions and inter-tool dependencies. Domain rules are encoded directly into tool structures, enabling cross-domain composition.
- Stage II (Task Creation): Tasks are constructed from anonymized real user profiles, composite instructions, and realistic environments augmented with curated distractors and transaction histories. Each task is iteratively validated with human checks to ensure clarity while preserving multiple valid solutions.
The benchmark includes databases covering 1,324 service providers, 6,942 products, and 334 transactions across all domains, with 27 write-type API tools, 33 read-type tools, and 6 general tools.
Evaluation Methodology
VitaBench introduces a rubric-based sliding window evaluator that enables robust assessment of diverse solution pathways in complex environments and stochastic interactions. Evaluation supports both single-domain and cross-domain configurations — cross-domain evaluation merges multiple domain environments into a unified environment by connecting domain names with commas. The framework supports configurable parameters including number of trials, concurrency, maximum steps, and language (Chinese or English).
Performance Results
According to the paper's comprehensive evaluation, even the most advanced models achieve only 32.5% success rate on cross-scenario tasks and less than 62% success rate on single-scenario tasks. The leaderboard (last updated January 2026) covers both thinking and non-thinking model categories, with models from Google, Anthropic, OpenAI, DeepSeek, Qwen, and others evaluated. The Meituan LongCat team's own LongCat-Flash-Thinking-2601 model ranks third among thinking models on cross-scenario tasks.
Update: ICLR 2026 Acceptance and Growing Adoption
VitaBench was accepted to ICLR 2026 in January 2026. An updated version was released the same month with rectified datasets and tools, upgraded evaluation models, and updated metrics for both proprietary and open language models. The English version of the dataset was released in November 2025, enabling broader international use. The Meituan LongCat Team reports that VitaBench has been cited by Qwen3.5 and Seed2.0, and the Qwen Team used it to evaluate Qwen3-Max-Thinking. The repository was last pushed to in February 2026 and has accumulated 133 stars and 13 forks on GitHub.
Community Discussions
Be the first to start a conversation about VitaBench
Share your experience with VitaBench, ask questions, or help others learn from your insights.
Pricing
Open Source
Fully free and open-source under the MIT License. Clone, use, modify, and distribute freely.
- Full benchmark codebase
- Dataset access via Hugging Face
- CLI evaluation pipeline
- English and Chinese language support
- Public leaderboard
Capabilities
Key Features
- 66 API tools across food delivery, in-store, and OTA domains
- 100 cross-scenario tasks and 300 single-scenario tasks
- Rubric-based sliding window evaluator
- Multi-turn conversation support with dynamic user intent tracking
- Cross-domain environment composition
- English and Chinese language support
- Configurable evaluation pipeline (trials, concurrency, max steps)
- Re-evaluation of existing simulations
- Public leaderboard with thinking and non-thinking model categories
- Hugging Face dataset integration
