Framework · Enterprise Platforms

DeepEval

DeepEval is an open-source Python framework that unit-tests the outputs of an LLM application against metric-based test cases, integrating with pytest.

Description

DeepEval defines test cases over an LLM application's inputs and outputs and runs metrics against them, returning pass or fail like a software test. Most of its metrics, including the G-Eval custom-criteria metric, are LLM-as-a-judge metrics that call a configurable model to score the output. It plugs into pytest so generation outputs are asserted in the same workflow as code tests.

Solution

DeepEval does not run an agent loop of its own. It is invoked as a test harness: a developer constructs test cases pairing an input, the actual LLM output, and optional expected context, attaches one or more metrics, and runs them under pytest. Each metric scores the case, and most metrics call a judge model to produce that score, after which the case passes or fails against a threshold.

Primary use cases

  • metric-based testing of LLM application outputs
  • custom-criteria evaluation with LLM judges
  • regression testing of prompts and model versions in CI

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.