DeepEval

Name: DeepEval
Availability: OnlineOnly
Author: Confident AI

DeepEval is an open-source LLM evaluation framework that enables developers to build reliable evaluation pipelines and test any AI system with 50+ research-backed metrics.

Visit Website

At a Glance

Pricing

Open Source

Free tier available

Free open-source LLM evaluation framework installable via pip.

Confident AI Cloud: Custom/contact

Engagement

Available On

Web

API

SDK

Confident AISan Francisco, CAEst. 2023$2.2M raised

Listed Mar 2026

About DeepEval

DeepEval is a comprehensive LLM evaluation framework used by leading AI companies including OpenAI, Google, Adobe, and Walmart. It provides a native Pytest integration that fits directly into CI/CD workflows, enabling unit-testing for LLMs with over 50 research-backed metrics. The framework supports single and multi-turn evaluations, multi-modal test cases (text, images, audio), synthetic data generation, and automatic prompt optimization.

Unit-Testing for LLMs — Install via pip install deepeval and integrate natively with Pytest to run evaluations in your CI/CD pipeline.
LLM-as-a-Judge Metrics — Access 50+ research-backed metrics including G-Eval (chain-of-thought criteria scoring), DAG (directed acyclic graph for multi-step scoring), and QAG (question-answer generation scoring).
Single and Multi-Turn Evaluations — Evaluate any use case and system architecture, including multi-turn conversational agents.
Native Multi-Modal Support — Evaluate text, images, and audio with built-in multi-modal test cases.
Synthetic Data Generation — Generate synthetic test datasets and simulate conversations when no test data is available.
Auto-Optimize Prompts — Automatically optimize prompts without manual tweaking using DeepEval's built-in prompt optimization.
Confident AI Cloud Platform — Use DeepEval on Confident AI for team-wide collaborative AI testing, regression testing, dataset management, observability, tracing, online monitoring, and human annotations.
Wide Framework Integrations — Integrates with OpenAI, LangChain, LlamaIndex, LangGraph, Pydantic AI, CrewAI, Anthropic, and OpenAI Agents.

Community Discussions

Be the first to start a conversation about DeepEval

Share your experience with DeepEval, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Free open-source LLM evaluation framework installable via pip.

50+ evaluation metrics
Pytest integration
CI/CD support
Multi-modal test cases
Synthetic data generation

Confident AI Cloud

Cloud platform for team-wide collaborative AI testing built on top of DeepEval.

Custom

contact sales

Regression testing
AI experiments
Dataset management
Observability and tracing
Online monitoring
Human annotations
Team collaboration

View official pricing

Capabilities

Key Features

50+ research-backed LLM evaluation metrics
G-Eval chain-of-thought scoring
DAG directed acyclic graph evaluation
QAG question-answer generation scoring
Native Pytest integration
CI/CD pipeline support
Single and multi-turn evaluations
Multi-modal test cases (text, images, audio)
Synthetic data generation
Conversation simulation
Automatic prompt optimization
LLM-as-a-Judge
Regression testing
Dataset management
Observability and tracing
Online monitoring
Human annotations

Integrations

OpenAI

LangChain

LlamaIndex

LangGraph

Pydantic AI

CrewAI

Anthropic

OpenAI Agents

Pytest

Confident AI

API Available

View Docs

Back to all tools

DeepEval

LLM Evaluations

DeepEval is an open-source LLM evaluation framework that enables developers to build reliable evaluation pipelines and test any AI system with 50+ research-backed metrics.

Visit Website

At a Glance

Pricing

Open Source

Free tier available

Free open-source LLM evaluation framework installable via pip.

Confident AI Cloud: Custom/contact

Engagement

19views

Discussions

Available On

Web

API

SDK

Resources

Website Docs GitHub llms.txt

Topics

LLM Evaluations Automated Testing Observability Platforms

Alternatives

Confident AI Patronus AI Ragas

Developer

Confident AISan Francisco, CAEst. 2023$2.2M raised

Listed Mar 2026

About DeepEval

Unit-Testing for LLMs — Install via pip install deepeval and integrate natively with Pytest to run evaluations in your CI/CD pipeline.
LLM-as-a-Judge Metrics — Access 50+ research-backed metrics including G-Eval (chain-of-thought criteria scoring), DAG (directed acyclic graph for multi-step scoring), and QAG (question-answer generation scoring).
Single and Multi-Turn Evaluations — Evaluate any use case and system architecture, including multi-turn conversational agents.
Native Multi-Modal Support — Evaluate text, images, and audio with built-in multi-modal test cases.
Synthetic Data Generation — Generate synthetic test datasets and simulate conversations when no test data is available.
Auto-Optimize Prompts — Automatically optimize prompts without manual tweaking using DeepEval's built-in prompt optimization.
Confident AI Cloud Platform — Use DeepEval on Confident AI for team-wide collaborative AI testing, regression testing, dataset management, observability, tracing, online monitoring, and human annotations.
Wide Framework Integrations — Integrates with OpenAI, LangChain, LlamaIndex, LangGraph, Pydantic AI, CrewAI, Anthropic, and OpenAI Agents.

Community Discussions

Be the first to start a conversation about DeepEval

Share your experience with DeepEval, ask questions, or help others learn from your insights.

Pricing

OPEN SOURCE

Open Source

Free open-source LLM evaluation framework installable via pip.

50+ evaluation metrics
Pytest integration
CI/CD support
Multi-modal test cases
Synthetic data generation

Confident AI Cloud

Cloud platform for team-wide collaborative AI testing built on top of DeepEval.

Custom

contact sales

Regression testing
AI experiments
Dataset management
Observability and tracing
Online monitoring
Human annotations
Team collaboration

View official pricing

Capabilities

Key Features

50+ research-backed LLM evaluation metrics
G-Eval chain-of-thought scoring
DAG directed acyclic graph evaluation
QAG question-answer generation scoring
Native Pytest integration
CI/CD pipeline support
Single and multi-turn evaluations
Multi-modal test cases (text, images, audio)
Synthetic data generation
Conversation simulation
Automatic prompt optimization
LLM-as-a-Judge
Regression testing
Dataset management
Observability and tracing
Online monitoring
Human annotations

Integrations

OpenAI

LangChain

LlamaIndex

LangGraph

Pydantic AI

CrewAI

Anthropic

OpenAI Agents

Pytest

Confident AI

API Available

View Docs

Back to all tools