Full-Code · Orchestration Frameworksactive

Inspect AI

Type: full-code · Vendor: UK AI Safety Institute (AISI) · Language: Python · License: MIT · Status: active · Status in practice: mature · First released: 2024-05-01

Links: homepage docs repo

Inspect AI is an evaluation framework that runs labelled datasets through solvers and scorers to measure language-model and agent performance.

Description. Inspect is an open-source framework for large language model evaluations from the UK AI Security Institute. An Inspect evaluation is a Task that combines a Dataset of labelled samples, a Solver that produces an answer for each sample, and a Scorer that evaluates the output. Solvers range from a single generate() call to a full tool-using agent, and scorers range from text comparison to model grading. Its model_graded_qa() scorer runs a separate grader model that sees only the question, answer, criterion, and instructions.

Agent loop shape. An evaluation is defined as a Task binding a Dataset, a Solver, and a Scorer. For each sample, the solver produces an answer, which may be a single generate() call or a full agent that uses tools over many turns. The scorer then evaluates the output by text comparison or by model grading, where model_graded_qa() invokes a separate grader model given only question, answer, criterion, and instructions, optionally with several graders voting.

Primary use cases

running held-out evaluation datasets against models and agents
model-graded scoring of open-ended answers
agentic and reasoning task evaluation
safety and capability benchmarking

flowchart TD fw["Inspect AI"] fw --> p1["Blind Grader with Isolated Context<br/>(core)"] fw --> p2["Eval Harness<br/>(core)"] fw --> p3["ReAct<br/>(supported)"] fw --> p4["Sandbox Isolation<br/>(supported)"]

Key concepts

Task → eval-harness (docs) — The unit of an Inspect evaluation, binding a Dataset, a Solver, and a Scorer into a runnable, repeatable evaluation.
Solver → react (docs) — The component that produces an answer for each sample, ranging from a single generate() call to a full tool-using agent run over many turns.
Scorer → blind-grader-with-isolated-context (docs) — The component that evaluates a solver's output, from text comparison (match, pattern) to model grading (model_graded_qa) against a target rubric.
Sandbox environment → sandbox-isolation (docs) — A provisioned isolated runtime (Docker by default) in which tools execute model-generated code, keeping untrusted execution out of the main evaluation process.

Inspect AI

Neighbourhood

Alternatives & relatives

Listed as alternative by (2)

References

Provenance