Arize Phoenix / Arize AX
Type: full-code · Vendor: Arize AI · Language: Python · License: Elastic-2.0 · Status: active · Status in practice: mature · First released: 2023-02-07
Arize Phoenix and Arize AX trace agent runs and evaluate them with LLM judges, scoring both individual steps and the full trajectory of tool calls.
Description. Arize provides open-source Phoenix and the hosted Arize AX platform for tracing and evaluating LLM and agent applications. It captures spans and traces of agent runs and evaluates them with LLM-based judges. Trajectory evaluations send the ordered list of an agent's tool calls to an LLM judge that classifies the run as correct or incorrect, catching mistakes that single-step evaluations miss. Phoenix Evals runs these LLM classifications against captured traces.
Agent loop shape. Arize observes the agent rather than running its loop. It ingests spans and traces from an instrumented agent, then evaluation jobs send the captured artifacts — an individual span, or the ordered list of tool calls forming a trajectory — to an LLM judge that classifies them as correct or incorrect against criteria, surfacing failures at the step or trajectory level for the developer to act on.
Primary use cases
- tracing LLM and agent runs
- LLM-as-judge evaluation of agent trajectories
- scoring individual tool-calling spans
- detecting mistakes an agent makes between steps
Key concepts
- OpenInference (docs) — Arize's set of OpenTelemetry semantic conventions for LLM and agent spans, giving every model call, tool invocation, and retrieval a standardized trace shape that Phoenix ingests.
- Phoenix Evals → llm-as-judge (docs) — The evaluation library that runs LLM classifications (via llm_classify) against captured traces and spans, scoring them correct or incorrect against a rubric.
- Datasets & Experiments → eval-harness (docs) — Versioned example sets and the experiment runs over them, with dataset evaluators that score outputs automatically — Phoenix's offline, repeatable testing workflow.
- Trajectory Evaluation → artifact-evaluation (docs) — A trace-level evaluation that hands the ordered list of an agent's tool calls to an LLM judge to score whether the whole run progressed logically, used the right tools, and was efficient.
Patterns this full-code implements —
- ★★LLM-as-Judge
Arize evaluations use an LLM judge to classify outputs against criteria — for example a trajectory that progresses logically, uses the right tools, and is reasonably efficient — when no exact-match m…
- ★Intermediate Artifact Evaluation
Arize evaluates intermediate agent artifacts — sending the ordered list of tool calls (the trajectory) to an LLM judge and scoring individual tool-calling spans — rather than only the final output, i…
- ★★Eval Harness
Phoenix datasets and experiments form an offline eval harness: a versioned dataset of examples is run through a task and its associated evaluators score the outputs automatically, giving repeatable r…
- ★Dual Evaluation (Offline + Online)
Arize runs evaluation on two tracks: offline experiments against curated datasets before deploy, and online evals that run continuously over production trace data on a rolling schedule, so quality is…
Neighbourhood
Click any neighbour to follow the lineage. Scroll to zoom, drag to pan.