DeepEval
Type: full-code · Vendor: Confident AI · Language: Python · License: Apache-2.0 · Status: active · Status in practice: mature · First released: 2023
DeepEval is an open-source Python framework that unit-tests the outputs of an LLM application against metric-based test cases, integrating with pytest.
Description. DeepEval defines test cases over an LLM application's inputs and outputs and runs metrics against them, returning pass or fail like a software test. Most of its metrics, including the G-Eval custom-criteria metric, are LLM-as-a-judge metrics that call a configurable model to score the output. It plugs into pytest so generation outputs are asserted in the same workflow as code tests.
Agent loop shape. DeepEval does not run an agent loop of its own. It is invoked as a test harness: a developer constructs test cases pairing an input, the actual LLM output, and optional expected context, attaches one or more metrics, and runs them under pytest. Each metric scores the case, and most metrics call a judge model to produce that score, after which the case passes or fails against a threshold.
Primary use cases
- metric-based testing of LLM application outputs
- custom-criteria evaluation with LLM judges
- regression testing of prompts and model versions in CI
Key concepts
- Metric → llm-as-judge (docs) — A scoring rule (such as AnswerRelevancy, Faithfulness, Hallucination, or the custom G-Eval) applied to a test case; most metrics are LLM-as-a-judge and return a score against a pass threshold.
- G-Eval → llm-as-judge (docs) — A research-backed metric where the user states free-form custom criteria and an LLM judge scores the output against them, used when no exact-match or built-in metric fits.
- Synthesizer → eval-harness (docs) — A generator that produces synthetic test cases (goldens) from documents, prepared contexts, or from scratch, then evolves them to be more complex, used to bootstrap an evaluation dataset when none exists.
- Faithfulness metric (docs) — A RAG metric that uses an LLM judge to check whether the generated output factually aligns with the retrieval context, flagging unsupported claims (hallucination) against retrieved evidence.
Patterns this full-code implements —
- ★Agent Evaluator
DeepEval is a dedicated testing harness whose sole job is running metric-based test cases against another LLM app's outputs, integrating with pytest so agent outputs are unit-tested like code; note i…
- ★★LLM-as-Judge
Most DeepEval metrics, including the G-Eval custom-criteria metric, are LLM-as-a-judge metrics that call a configurable judge model to score the output against rubric criteria rather than against an…
- ★★Eval Harness
DeepEval evaluates an LLM application end-to-end against a dataset of test cases (goldens) with deepeval test run, and its Synthesizer can build that held-out dataset from documents, so successive ap…
- ★★Eval as Contract
DeepEval evals are written as pytest tests and run with deepeval test run in the pipeline; a failing metric fails the build, so a release is gated on the eval suite passing.
- ★Dual Evaluation (Offline + Online)
Beyond the offline test-run track, DeepEval can run online evals that monitor production traces, spans, and threads, so the same metrics gate before deploy and observe live traffic after; the online…
Neighbourhood
Click any neighbour to follow the lineage. Scroll to zoom, drag to pan.