VitaBench

Name: VitaBench
Availability: OnlineOnly
Author: Meituan LongCat Team

An open-source benchmark for evaluating LLM agents on versatile interactive tasks grounded in real-world life-serving applications like food delivery, in-store consumption, and online travel.

Visit Website

At a Glance

Pricing

Open Source

Fully free and open-source under the MIT License. Clone, use, modify, and distribute freely.

Engagement

Available On

CLI

API

Meituan LongCat TeamThe Meituan LongCat Team builds AI research tools and benchm…

Listed May 2026

About VitaBench

VitaBench is an open-source benchmark developed by the Meituan LongCat Team that evaluates LLM-based agents on complex, multi-turn interactive tasks drawn from real-world daily service scenarios. It was accepted to ICLR 2026 and is freely available on GitHub under the MIT License, with the dataset hosted on Hugging Face.

What It Is

VitaBench (where "Vita" derives from the Latin word for "Life") is a research benchmark designed to stress-test LLM agents in realistic, life-serving simulation environments. Unlike simpler benchmarks, it draws from three real-world application domains — food delivery, in-store consumption, and online travel services (OTA) — and presents agents with 66 tools, 100 cross-scenario tasks, and 300 single-scenario tasks. Each task requires agents to reason across temporal and spatial dimensions, handle complex tool sets, proactively clarify ambiguous instructions, and track shifting user intent across multi-turn conversations.

Benchmark Architecture

VitaBench is built through a two-stage pipeline:

Stage I (Framework Design): Real life-serving scenarios are abstracted into a directed graph of simplified API tools with explicit pre- and post-conditions and inter-tool dependencies. Domain rules are encoded directly into tool structures, enabling cross-domain composition.
Stage II (Task Creation): Tasks are constructed from anonymized real user profiles, composite instructions, and realistic environments augmented with curated distractors and transaction histories. Each task is iteratively validated with human checks to ensure clarity while preserving multiple valid solutions.

The benchmark includes databases covering 1,324 service providers, 6,942 products, and 334 transactions across all domains, with 27 write-type API tools, 33 read-type tools, and 6 general tools.

Evaluation Methodology

VitaBench introduces a rubric-based sliding window evaluator that enables robust assessment of diverse solution pathways in complex environments and stochastic interactions. Evaluation supports both single-domain and cross-domain configurations — cross-domain evaluation merges multiple domain environments into a unified environment by connecting domain names with commas. The framework supports configurable parameters including number of trials, concurrency, maximum steps, and language (Chinese or English).

Performance Results

According to the paper's comprehensive evaluation, even the most advanced models achieve only 32.5% success rate on cross-scenario tasks and less than 62% success rate on single-scenario tasks. The leaderboard (last updated January 2026) covers both thinking and non-thinking model categories, with models from Google, Anthropic, OpenAI, DeepSeek, Qwen, and others evaluated. The Meituan LongCat team's own LongCat-Flash-Thinking-2601 model ranks third among thinking models on cross-scenario tasks.

Update: ICLR 2026 Acceptance and Growing Adoption

VitaBench was accepted to ICLR 2026 in January 2026. An updated version was released the same month with rectified datasets and tools, upgraded evaluation models, and updated metrics for both proprietary and open language models. The English version of the dataset was released in November 2025, enabling broader international use. The Meituan LongCat Team reports that VitaBench has been cited by Qwen3.5 and Seed2.0, and the Qwen Team used it to evaluate Qwen3-Max-Thinking. The repository was last pushed to in February 2026 and has accumulated 133 stars and 13 forks on GitHub.

Community Discussions

Be the first to start a conversation about VitaBench

Share your experience with VitaBench, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Fully free and open-source under the MIT License. Clone, use, modify, and distribute freely.

Full benchmark codebase
Dataset access via Hugging Face
CLI evaluation pipeline
English and Chinese language support
Public leaderboard

Capabilities

Key Features

66 API tools across food delivery, in-store, and OTA domains
100 cross-scenario tasks and 300 single-scenario tasks
Rubric-based sliding window evaluator
Multi-turn conversation support with dynamic user intent tracking
Cross-domain environment composition
English and Chinese language support
Configurable evaluation pipeline (trials, concurrency, max steps)
Re-evaluation of existing simulations
Public leaderboard with thinking and non-thinking model categories
Hugging Face dataset integration

Integrations

Hugging Face Datasets

OpenAI API

Anthropic Claude API

Google Gemini API

DeepSeek API

Qwen API

Custom LLM endpoints via models.yaml

API Available

View Docs

Back to all tools Suggest an edit