Arize Phoenix / Arize AX

Type: full-code · Vendor: Arize AI · Language: Python · License: Elastic-2.0 · Status: active · Status in practice: mature · First released: 2023-02-07

Links: homepage docs repo

Arize Phoenix and Arize AX trace agent runs and evaluate them with LLM judges, scoring both individual steps and the full trajectory of tool calls.

Description. Arize provides open-source Phoenix and the hosted Arize AX platform for tracing and evaluating LLM and agent applications. It captures spans and traces of agent runs and evaluates them with LLM-based judges. Trajectory evaluations send the ordered list of an agent's tool calls to an LLM judge that classifies the run as correct or incorrect, catching mistakes that single-step evaluations miss. Phoenix Evals runs these LLM classifications against captured traces.

Agent loop shape. Arize observes the agent rather than running its loop. It ingests spans and traces from an instrumented agent, then evaluation jobs send the captured artifacts — an individual span, or the ordered list of tool calls forming a trajectory — to an LLM judge that classifies them as correct or incorrect against criteria, surfacing failures at the step or trajectory level for the developer to act on.

Primary use cases

tracing LLM and agent runs
LLM-as-judge evaluation of agent trajectories
scoring individual tool-calling spans
detecting mistakes an agent makes between steps

flowchart TD fw["Arize Phoenix / Arize AX"] fw --> p1["LLM-as-Judge<br/>(core)"] fw --> p2["Intermediate Artifact Evaluation<br/>(first-class)"] fw --> p3["Eval Harness<br/>(first-class)"] fw --> p4["Dual Evaluation (Offline + Online)<br/>(supported)"]

Key concepts

OpenInference (docs) — Arize's set of OpenTelemetry semantic conventions for LLM and agent spans, giving every model call, tool invocation, and retrieval a standardized trace shape that Phoenix ingests.
Phoenix Evals → llm-as-judge (docs) — The evaluation library that runs LLM classifications (via llm_classify) against captured traces and spans, scoring them correct or incorrect against a rubric.
Datasets & Experiments → eval-harness (docs) — Versioned example sets and the experiment runs over them, with dataset evaluators that score outputs automatically — Phoenix's offline, repeatable testing workflow.
Trajectory Evaluation → artifact-evaluation (docs) — A trace-level evaluation that hands the ordered list of an agent's tool calls to an LLM judge to score whether the whole run progressed logically, used the right tools, and was efficient.

Arize Phoenix / Arize AX

Neighbourhood

Anti-patterns avoided

Alternatives & relatives

Listed as alternative by (1)

References

Provenance