Framework · Orchestration Frameworks

Inspect AI

Inspect AI is an evaluation framework that runs labelled datasets through solvers and scorers to measure language-model and agent performance.

Description

Inspect is an open-source framework for large language model evaluations from the UK AI Security Institute. An Inspect evaluation is a Task that combines a Dataset of labelled samples, a Solver that produces an answer for each sample, and a Scorer that evaluates the output. Solvers range from a single generate() call to a full tool-using agent, and scorers range from text comparison to model grading. Its model_graded_qa() scorer runs a separate grader model that sees only the question, answer, criterion, and instructions.

Solution

An evaluation is defined as a Task binding a Dataset, a Solver, and a Scorer. For each sample, the solver produces an answer, which may be a single generate() call or a full agent that uses tools over many turns. The scorer then evaluates the output by text comparison or by model grading, where model_graded_qa() invokes a separate grader model given only question, answer, criterion, and instructions, optionally with several graders voting.

Primary use cases

  • running held-out evaluation datasets against models and agents
  • model-graded scoring of open-ended answers
  • agentic and reasoning task evaluation
  • safety and capability benchmarking

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.