Full-Code · Orchestration Frameworksactive

Inspect AI

Type: full-code · Vendor: UK AI Safety Institute (AISI) · Language: Python · License: MIT · Status: active · Status in practice: mature · First released: 2024-05-01

Links: homepage docs repo

Inspect AI is an evaluation framework that runs labelled datasets through solvers and scorers to measure language-model and agent performance.

Description. Inspect is an open-source framework for large language model evaluations from the UK AI Security Institute. An Inspect evaluation is a Task that combines a Dataset of labelled samples, a Solver that produces an answer for each sample, and a Scorer that evaluates the output. Solvers range from a single generate() call to a full tool-using agent, and scorers range from text comparison to model grading. Its model_graded_qa() scorer runs a separate grader model that sees only the question, answer, criterion, and instructions.

Agent loop shape. An evaluation is defined as a Task binding a Dataset, a Solver, and a Scorer. For each sample, the solver produces an answer, which may be a single generate() call or a full agent that uses tools over many turns. The scorer then evaluates the output by text comparison or by model grading, where model_graded_qa() invokes a separate grader model given only question, answer, criterion, and instructions, optionally with several graders voting.

Primary use cases

  • running held-out evaluation datasets against models and agents
  • model-graded scoring of open-ended answers
  • agentic and reasoning task evaluation
  • safety and capability benchmarking

Key concepts

  • Task eval-harness (docs)The unit of an Inspect evaluation, binding a Dataset, a Solver, and a Scorer into a runnable, repeatable evaluation.
  • Solver react (docs)The component that produces an answer for each sample, ranging from a single generate() call to a full tool-using agent run over many turns.
  • Scorer blind-grader-with-isolated-context (docs)The component that evaluates a solver's output, from text comparison (match, pattern) to model grading (model_graded_qa) against a target rubric.
  • Sandbox environment sandbox-isolation (docs)A provisioned isolated runtime (Docker by default) in which tools execute model-generated code, keeping untrusted execution out of the main evaluation process.

Patterns this full-code implements —

Neighbourhood

Click any neighbour to follow the lineage. Scroll to zoom, drag to pan.