# DeepEval

> DeepEval is an open-source LLM evaluation framework that enables developers to build reliable evaluation pipelines and test any AI system with 50+ research-backed metrics.

DeepEval is a comprehensive LLM evaluation framework used by leading AI companies including OpenAI, Google, Adobe, and Walmart. It provides a native Pytest integration that fits directly into CI/CD workflows, enabling unit-testing for LLMs with over 50 research-backed metrics. The framework supports single and multi-turn evaluations, multi-modal test cases (text, images, audio), synthetic data generation, and automatic prompt optimization.

- **Unit-Testing for LLMs** — *Install via `pip install deepeval` and integrate natively with Pytest to run evaluations in your CI/CD pipeline.*
- **LLM-as-a-Judge Metrics** — *Access 50+ research-backed metrics including G-Eval (chain-of-thought criteria scoring), DAG (directed acyclic graph for multi-step scoring), and QAG (question-answer generation scoring).*
- **Single and Multi-Turn Evaluations** — *Evaluate any use case and system architecture, including multi-turn conversational agents.*
- **Native Multi-Modal Support** — *Evaluate text, images, and audio with built-in multi-modal test cases.*
- **Synthetic Data Generation** — *Generate synthetic test datasets and simulate conversations when no test data is available.*
- **Auto-Optimize Prompts** — *Automatically optimize prompts without manual tweaking using DeepEval's built-in prompt optimization.*
- **Confident AI Cloud Platform** — *Use DeepEval on Confident AI for team-wide collaborative AI testing, regression testing, dataset management, observability, tracing, online monitoring, and human annotations.*
- **Wide Framework Integrations** — *Integrates with OpenAI, LangChain, LlamaIndex, LangGraph, Pydantic AI, CrewAI, Anthropic, and OpenAI Agents.*

## Features
- 50+ research-backed LLM evaluation metrics
- G-Eval chain-of-thought scoring
- DAG directed acyclic graph evaluation
- QAG question-answer generation scoring
- Native Pytest integration
- CI/CD pipeline support
- Single and multi-turn evaluations
- Multi-modal test cases (text, images, audio)
- Synthetic data generation
- Conversation simulation
- Automatic prompt optimization
- LLM-as-a-Judge
- Regression testing
- Dataset management
- Observability and tracing
- Online monitoring
- Human annotations

## Integrations
OpenAI, LangChain, LlamaIndex, LangGraph, Pydantic AI, CrewAI, Anthropic, OpenAI Agents, Pytest, Confident AI

## Platforms
WEB, API, DEVELOPER_SDK

## Pricing
Open Source, Free tier available

## Links
- Website: https://deepeval.com
- Documentation: https://deepeval.com/docs/getting-started
- Repository: https://github.com/confident-ai/deepeval
- EveryDev.ai: https://www.everydev.ai/tools/deepeval