Full-Code · Enterprise Platformsactive

DeepEval

Type: full-code · Vendor: Confident AI · Language: Python · License: Apache-2.0 · Status: active · Status in practice: mature · First released: 2023

Links: homepage docs repo

DeepEval is an open-source Python framework that unit-tests the outputs of an LLM application against metric-based test cases, integrating with pytest.

Description. DeepEval defines test cases over an LLM application's inputs and outputs and runs metrics against them, returning pass or fail like a software test. Most of its metrics, including the G-Eval custom-criteria metric, are LLM-as-a-judge metrics that call a configurable model to score the output. It plugs into pytest so generation outputs are asserted in the same workflow as code tests.

Agent loop shape. DeepEval does not run an agent loop of its own. It is invoked as a test harness: a developer constructs test cases pairing an input, the actual LLM output, and optional expected context, attaches one or more metrics, and runs them under pytest. Each metric scores the case, and most metrics call a judge model to produce that score, after which the case passes or fails against a threshold.

Primary use cases

  • metric-based testing of LLM application outputs
  • custom-criteria evaluation with LLM judges
  • regression testing of prompts and model versions in CI

Key concepts

  • Metric llm-as-judge (docs)A scoring rule (such as AnswerRelevancy, Faithfulness, Hallucination, or the custom G-Eval) applied to a test case; most metrics are LLM-as-a-judge and return a score against a pass threshold.
  • G-Eval llm-as-judge (docs)A research-backed metric where the user states free-form custom criteria and an LLM judge scores the output against them, used when no exact-match or built-in metric fits.
  • Synthesizer eval-harness (docs)A generator that produces synthetic test cases (goldens) from documents, prepared contexts, or from scratch, then evolves them to be more complex, used to bootstrap an evaluation dataset when none exists.
  • Faithfulness metric (docs)A RAG metric that uses an LLM judge to check whether the generated output factually aligns with the retrieval context, flagging unsupported claims (hallucination) against retrieved evidence.

Patterns this full-code implements —

Neighbourhood

Click any neighbour to follow the lineage. Scroll to zoom, drag to pan.