Methodology · Evaluationemergingverified

Real-World Agent Trial

also known as agent field trial, live capability probe

Applies to: agentmulti-agent-systemcoding-agentbrowser-agent

Tags: field-trialobservationcapability-probe

Try a candidate agent on real, open-ended tasks before you give it to production users. The tasks are realistic, such as office work, strategic games, and problems that mix text, images, and other inputs. The trial is about watching. People observe how the agent handles full tasks under real conditions. They log where it works and where it fails. From that, they write down what the agent can and cannot do, backed by what was actually seen, not by benchmark numbers. This adds to per-ability and whole-system tests. It does not replace them. It surfaces failures that made-up test sets miss.

Methodology process overview

Intent. Find out what an agent really can and cannot do by watching it work through real, open-ended tasks under field conditions.

When to apply. Use this before you commit to a wider rollout of a non-trivial agent. It helps most when the made-up tests feel untrustworthy, or when people disagree about what the agent can actually do. Don't apply it for narrow agents whose behaviour an existing test set already covers fully. Skip it too for one-shot generation systems, where there is no run to watch.

Inputs

  • Candidate agentThe agent build to be tried, set up so its actions can be observed.
  • Realistic task catalogueOpen-ended tasks from the intended use area: office workflows, strategic games, and problems that mix text, images, and other inputs.
  • Human observersReviewers who know the field, watch live runs, and write notes on what the agent does.

Outputs

  • Capability claimsStatements about what the agent reliably does, each backed by a run that was watched.
  • Constraint claimsStatements about what the agent cannot do, or does unreliably, each with a reference to a run.
  • Trial reportA record of what was observed. It informs the go or no-go call and the shape of the rollout.

Steps (5)

  1. Curate realistic tasks

    Pick a small, varied task set that looks like real use: drafting documents, scheduling, summarising meetings, playing through a strategic game, and answering a multi-step research question. Avoid tasks the test set already covers.

  2. Instrument the agent for observation

    Record the full run: tool calls, the reasoning in between, retries, and timings. Observers cannot grade what they cannot see.

    usesDecision LogLineage Tracking

  3. Run trials with human observation

    Have people who know the field watch the agent work through each task end-to-end. Note where it stalls, loops, makes things up, succeeds, or surprises them.

  4. Extract grounded capability and constraint claims

    Turn what you saw into specific claims, each with a run reference. For example, 'the agent reliably drafts X when given Y', or 'the agent fails when the input contains Z'. Reject any claim that no logged run backs up.

  5. Feed findings back into evals and rollout gating

    Add the newly found failures to the regression test set. Use the confirmed capabilities and limits to set how much freedom the agent gets in the next rollout step.

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

  • A trial produces claims about behaviour, not benchmark deltas. The unit of evidence is the watched run.
  • Real, open-ended tasks reveal failures that made-up tests miss by design.
  • Every claim must be tied to a specific logged run, or it does not ship.
  • Findings feed both the test set and the rollout gate. A trial that updates neither was wasted.

Known failure modes (2)

Related patterns (4)

Related compositions (2)

Related methodologies (2)

Sources (2)

Provenance

  • Added to catalog:
  • Last updated:
  • Verification status: verified