DeepEval

Type: full-code · Vendor: Confident AI · Language: Python · License: Apache-2.0 · Status: active · Status in practice: mature · First released: 2023

Links: homepage docs repo

DeepEval is an open-source Python framework that unit-tests the outputs of an LLM application against metric-based test cases, integrating with pytest.

Description. DeepEval defines test cases over an LLM application's inputs and outputs and runs metrics against them, returning pass or fail like a software test. Most of its metrics, including the G-Eval custom-criteria metric, are LLM-as-a-judge metrics that call a configurable model to score the output. It plugs into pytest so generation outputs are asserted in the same workflow as code tests.

Agent loop shape. DeepEval does not run an agent loop of its own. It is invoked as a test harness: a developer constructs test cases pairing an input, the actual LLM output, and optional expected context, attaches one or more metrics, and runs them under pytest. Each metric scores the case, and most metrics call a judge model to produce that score, after which the case passes or fails against a threshold.

Primary use cases

metric-based testing of LLM application outputs
custom-criteria evaluation with LLM judges
regression testing of prompts and model versions in CI

flowchart TD fw["DeepEval"] fw --> p1["Agent Evaluator (core)"] fw --> p2["LLM-as-Judge (core)"] fw --> p3["Eval Harness (first-class)"] fw --> p4["Eval as Contract (first-class)"] fw --> p5["Dual Evaluation (Offline + Online) (supported)"]

Key concepts

Metric → llm-as-judge (docs) — A scoring rule (such as AnswerRelevancy, Faithfulness, Hallucination, or the custom G-Eval) applied to a test case; most metrics are LLM-as-a-judge and return a score against a pass threshold.
G-Eval → llm-as-judge (docs) — A research-backed metric where the user states free-form custom criteria and an LLM judge scores the output against them, used when no exact-match or built-in metric fits.
Synthesizer → eval-harness (docs) — A generator that produces synthetic test cases (goldens) from documents, prepared contexts, or from scratch, then evolves them to be more complex, used to bootstrap an evaluation dataset when none exists.
Faithfulness metric (docs) — A RAG metric that uses an LLM judge to check whether the generated output factually aligns with the retrieval context, flagging unsupported claims (hallucination) against retrieved evidence.

DeepEval

Neighbourhood

Alternatives & relatives

Listed as alternative by (3)

References

Provenance