EveryDev.ai
Sign inSubscribe
Explore AI Tools
  • AI Coding Assistants
  • Agent Frameworks
  • MCP Servers
  • AI Prompt Tools
  • Vibe Coding Tools
  • AI Design Tools
  • AI Database Tools
  • AI Website Builders
  • AI Testing Tools
  • LLM Evaluations
Follow Us
  • X / Twitter
  • LinkedIn
  • Reddit
  • Discord
  • Threads
  • Bluesky
  • Mastodon
  • YouTube
  • GitHub
  • Instagram
Get Started
  • About
  • Editorial Standards
  • Corrections & Disclosures
  • Community Guidelines
  • Advertise
  • Contact Us
  • Newsletter
  • Submit a Tool
  • Start a Discussion
  • Write A Blog
  • Share A Build
  • Terms of Service
  • Privacy Policy
Explore with AI
  • ChatGPT
  • Gemini
  • Claude
  • Grok
  • Perplexity
Agent Experience
  • llms.txt
Theme
With AI, Everyone is a Dev. EveryDev.ai © 2026
Main Menu
  • Tools
  • Developers
  • Topics
  • Discussions
  • Communities
  • News
  • Podcasts
  • Blogs
  • Builds
  • Contests
  • Compare
  • Arena
Create
    Home
    Tools

    2,480+ AI tools

    • New
    • Trending
    • Featured
    • Compare
    • Arena
    Categories
    • Agents1596
    • Coding1181
    • Infrastructure526
    • Marketing447
    • Design427
    • Projects384
    • Research357
    • Analytics331
    • Testing221
    • MCP216
    • Data205
    • Security196
    • Integration169
    • Learning154
    • Communication146
    • Prompts140
    • Extensions137
    • Commerce123
    • Voice122
    • DevOps99
    • Web77
    • Finance21
    1. Home
    2. Tools
    3. VitaBench
    VitaBench icon

    VitaBench

    LLM Evaluations

    An open-source benchmark for evaluating LLM agents on versatile interactive tasks grounded in real-world life-serving applications like food delivery, in-store consumption, and online travel.

    Visit Website

    At a Glance

    Pricing
    Open Source

    Fully free and open-source under the MIT License. Clone, use, modify, and distribute freely.

    Engagement

    Available On

    CLI
    API

    Resources

    WebsiteDocsGitHubllms.txt

    Topics

    LLM EvaluationsAgent FrameworksAcademic Research

    Alternatives

    EnterpriseRAG-BenchAmplifyingArize AI
    Developer
    Meituan LongCat TeamThe Meituan LongCat Team builds AI research tools and benchm…

    Listed May 2026

    About VitaBench

    VitaBench is an open-source benchmark developed by the Meituan LongCat Team that evaluates LLM-based agents on complex, multi-turn interactive tasks drawn from real-world daily service scenarios. It was accepted to ICLR 2026 and is freely available on GitHub under the MIT License, with the dataset hosted on Hugging Face.

    What It Is

    VitaBench (where "Vita" derives from the Latin word for "Life") is a research benchmark designed to stress-test LLM agents in realistic, life-serving simulation environments. Unlike simpler benchmarks, it draws from three real-world application domains — food delivery, in-store consumption, and online travel services (OTA) — and presents agents with 66 tools, 100 cross-scenario tasks, and 300 single-scenario tasks. Each task requires agents to reason across temporal and spatial dimensions, handle complex tool sets, proactively clarify ambiguous instructions, and track shifting user intent across multi-turn conversations.

    Benchmark Architecture

    VitaBench is built through a two-stage pipeline:

    • Stage I (Framework Design): Real life-serving scenarios are abstracted into a directed graph of simplified API tools with explicit pre- and post-conditions and inter-tool dependencies. Domain rules are encoded directly into tool structures, enabling cross-domain composition.
    • Stage II (Task Creation): Tasks are constructed from anonymized real user profiles, composite instructions, and realistic environments augmented with curated distractors and transaction histories. Each task is iteratively validated with human checks to ensure clarity while preserving multiple valid solutions.

    The benchmark includes databases covering 1,324 service providers, 6,942 products, and 334 transactions across all domains, with 27 write-type API tools, 33 read-type tools, and 6 general tools.

    Evaluation Methodology

    VitaBench introduces a rubric-based sliding window evaluator that enables robust assessment of diverse solution pathways in complex environments and stochastic interactions. Evaluation supports both single-domain and cross-domain configurations — cross-domain evaluation merges multiple domain environments into a unified environment by connecting domain names with commas. The framework supports configurable parameters including number of trials, concurrency, maximum steps, and language (Chinese or English).

    Performance Results

    According to the paper's comprehensive evaluation, even the most advanced models achieve only 32.5% success rate on cross-scenario tasks and less than 62% success rate on single-scenario tasks. The leaderboard (last updated January 2026) covers both thinking and non-thinking model categories, with models from Google, Anthropic, OpenAI, DeepSeek, Qwen, and others evaluated. The Meituan LongCat team's own LongCat-Flash-Thinking-2601 model ranks third among thinking models on cross-scenario tasks.

    Update: ICLR 2026 Acceptance and Growing Adoption

    VitaBench was accepted to ICLR 2026 in January 2026. An updated version was released the same month with rectified datasets and tools, upgraded evaluation models, and updated metrics for both proprietary and open language models. The English version of the dataset was released in November 2025, enabling broader international use. The Meituan LongCat Team reports that VitaBench has been cited by Qwen3.5 and Seed2.0, and the Qwen Team used it to evaluate Qwen3-Max-Thinking. The repository was last pushed to in February 2026 and has accumulated 133 stars and 13 forks on GitHub.

    VitaBench - 1

    Community Discussions

    Be the first to start a conversation about VitaBench

    Share your experience with VitaBench, ask questions, or help others learn from your insights.

    Pricing

    OPEN SOURCE

    Open Source

    Fully free and open-source under the MIT License. Clone, use, modify, and distribute freely.

    • Full benchmark codebase
    • Dataset access via Hugging Face
    • CLI evaluation pipeline
    • English and Chinese language support
    • Public leaderboard

    Capabilities

    Key Features

    • 66 API tools across food delivery, in-store, and OTA domains
    • 100 cross-scenario tasks and 300 single-scenario tasks
    • Rubric-based sliding window evaluator
    • Multi-turn conversation support with dynamic user intent tracking
    • Cross-domain environment composition
    • English and Chinese language support
    • Configurable evaluation pipeline (trials, concurrency, max steps)
    • Re-evaluation of existing simulations
    • Public leaderboard with thinking and non-thinking model categories
    • Hugging Face dataset integration

    Integrations

    Hugging Face Datasets
    OpenAI API
    Anthropic Claude API
    Google Gemini API
    DeepSeek API
    Qwen API
    Custom LLM endpoints via models.yaml
    API Available
    View Docs

    Reviews & Ratings

    No ratings yet

    Be the first to rate VitaBench and help others make informed decisions.

    Developer

    Meituan LongCat Team

    The Meituan LongCat Team builds AI research tools and benchmarks focused on real-world life-serving applications. Operating out of Meituan, one of China's largest on-demand service platforms, the team develops evaluation frameworks and language models grounded in practical daily-service scenarios. Their work spans agent benchmarking, LLM development, and open-source contributions to the AI research community.

    Read more about Meituan LongCat Team
    WebsiteGitHub
    1 tool in directory

    Similar Tools

    EnterpriseRAG-Bench icon

    EnterpriseRAG-Bench

    An open-source benchmark dataset of 500,000+ enterprise documents and 500 questions for evaluating RAG systems on realistic company internal data.

    Amplifying icon

    Amplifying

    AI benchmarking research studio that systematically measures the subjective choices AI systems make, such as tool recommendations, product picks, and build decisions.

    Arize AI icon

    Arize AI

    Arize AI is an enterprise AI and agent engineering platform for development, observability, and evaluation of LLM applications, AI agents, and ML models in production.

    Browse all tools

    Related Topics

    LLM Evaluations

    Platforms and frameworks for evaluating, testing, and benchmarking LLM systems and AI applications. These tools provide evaluators and evaluation models to score AI outputs, measure hallucinations, assess RAG quality, detect failures, and optimize model performance. Features include automated testing with LLM-as-a-judge metrics, component-level evaluation with tracing, regression testing in CI/CD pipelines, custom evaluator creation, dataset curation, and real-time monitoring of production systems. Teams use these solutions to validate prompt effectiveness, compare models side-by-side, ensure answer correctness and relevance, identify bias and toxicity, prevent PII leakage, and continuously improve AI product quality through experiments, benchmarks, and performance analytics.

    82 tools

    Agent Frameworks

    Tools and platforms for building and deploying custom AI agents.

    341 tools

    Academic Research

    AI tools designed specifically for academic and scientific research.

    42 tools
    Browse all topics
    Back to all tools
    Discussions