Inspect AI
Type: full-code · Vendor: UK AI Safety Institute (AISI) · Language: Python · License: MIT · Status: active · Status in practice: mature · First released: 2024-05-01
Inspect AI is an evaluation framework that runs labelled datasets through solvers and scorers to measure language-model and agent performance.
Description. Inspect is an open-source framework for large language model evaluations from the UK AI Security Institute. An Inspect evaluation is a Task that combines a Dataset of labelled samples, a Solver that produces an answer for each sample, and a Scorer that evaluates the output. Solvers range from a single generate() call to a full tool-using agent, and scorers range from text comparison to model grading. Its model_graded_qa() scorer runs a separate grader model that sees only the question, answer, criterion, and instructions.
Agent loop shape. An evaluation is defined as a Task binding a Dataset, a Solver, and a Scorer. For each sample, the solver produces an answer, which may be a single generate() call or a full agent that uses tools over many turns. The scorer then evaluates the output by text comparison or by model grading, where model_graded_qa() invokes a separate grader model given only question, answer, criterion, and instructions, optionally with several graders voting.
Primary use cases
- running held-out evaluation datasets against models and agents
- model-graded scoring of open-ended answers
- agentic and reasoning task evaluation
- safety and capability benchmarking
Key concepts
- Task → eval-harness (docs) — The unit of an Inspect evaluation, binding a Dataset, a Solver, and a Scorer into a runnable, repeatable evaluation.
- Solver → react (docs) — The component that produces an answer for each sample, ranging from a single generate() call to a full tool-using agent run over many turns.
- Scorer → blind-grader-with-isolated-context (docs) — The component that evaluates a solver's output, from text comparison (match, pattern) to model grading (model_graded_qa) against a target rubric.
- Sandbox environment → sandbox-isolation (docs) — A provisioned isolated runtime (Docker by default) in which tools execute model-generated code, keeping untrusted execution out of the main evaluation process.
Patterns this full-code implements —
- ★Blind Grader with Isolated Context
Its model_graded_qa() scorer runs a separate grader model whose prompt template exposes only question, answer, criterion and instructions, so the grader judges the answer against the rubric without e…
- ★★Eval Harness
An Inspect evaluation is a Task binding a Dataset of labelled samples, a Solver that produces an answer per sample, and a Scorer that evaluates the output, so a held-out dataset is run against models…
- ★★ReAct
The agents module ships a general-purpose react agent solver that runs a tool loop, interleaving reasoning and tool calls until the model invokes the special submit() tool to signal completion.
- ★★Sandbox Isolation
When a solver's tools execute arbitrary model-generated code (shell or Python), Inspect provisions sandboxes — Docker containers by default — so untrusted code runs in an isolated environment rather…
Neighbourhood
Click any neighbour to follow the lineage. Scroll to zoom, drag to pan.