Methodology · Evaluation

Real-World Agent Trial

Find out what an agent really can and cannot do by watching it work through real, open-ended tasks under field conditions.

Description

Try a candidate agent on real, open-ended tasks before you give it to production users. The tasks are realistic, such as office work, strategic games, and problems that mix text, images, and other inputs. The trial is about watching. People observe how the agent handles full tasks under real conditions. They log where it works and where it fails. From that, they write down what the agent can and cannot do, backed by what was actually seen, not by benchmark numbers. This adds to per-ability and whole-system tests. It does not replace them. It surfaces failures that made-up test sets miss.

When to apply

Use this before you commit to a wider rollout of a non-trivial agent. It helps most when the made-up tests feel untrustworthy, or when people disagree about what the agent can actually do. Don't apply it for narrow agents whose behaviour an existing test set already covers fully. Skip it too for one-shot generation systems, where there is no run to watch.

What it involves

  • Curate realistic tasks
  • Instrument the agent for observation
  • Run trials with human observation
  • Extract grounded capability and constraint claims
  • Feed findings back into evals and rollout gating

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Related