Full-Code · Enterprise Platformsactive

Arize Phoenix / Arize AX

Type: full-code · Vendor: Arize AI · Language: Python · License: Elastic-2.0 · Status: active · Status in practice: mature · First released: 2023-02-07

Links: homepage docs repo

Arize Phoenix and Arize AX trace agent runs and evaluate them with LLM judges, scoring both individual steps and the full trajectory of tool calls.

Description. Arize provides open-source Phoenix and the hosted Arize AX platform for tracing and evaluating LLM and agent applications. It captures spans and traces of agent runs and evaluates them with LLM-based judges. Trajectory evaluations send the ordered list of an agent's tool calls to an LLM judge that classifies the run as correct or incorrect, catching mistakes that single-step evaluations miss. Phoenix Evals runs these LLM classifications against captured traces.

Agent loop shape. Arize observes the agent rather than running its loop. It ingests spans and traces from an instrumented agent, then evaluation jobs send the captured artifacts — an individual span, or the ordered list of tool calls forming a trajectory — to an LLM judge that classifies them as correct or incorrect against criteria, surfacing failures at the step or trajectory level for the developer to act on.

Primary use cases

  • tracing LLM and agent runs
  • LLM-as-judge evaluation of agent trajectories
  • scoring individual tool-calling spans
  • detecting mistakes an agent makes between steps

Key concepts

  • OpenInference (docs)Arize's set of OpenTelemetry semantic conventions for LLM and agent spans, giving every model call, tool invocation, and retrieval a standardized trace shape that Phoenix ingests.
  • Phoenix Evals llm-as-judge (docs)The evaluation library that runs LLM classifications (via llm_classify) against captured traces and spans, scoring them correct or incorrect against a rubric.
  • Datasets & Experiments eval-harness (docs)Versioned example sets and the experiment runs over them, with dataset evaluators that score outputs automatically — Phoenix's offline, repeatable testing workflow.
  • Trajectory Evaluation artifact-evaluation (docs)A trace-level evaluation that hands the ordered list of an agent's tool calls to an LLM judge to score whether the whole run progressed logically, used the right tools, and was efficient.

Patterns this full-code implements —

Neighbourhood

Click any neighbour to follow the lineage. Scroll to zoom, drag to pan.