Braintrust

Type: full-code · Vendor: Braintrust Data, Inc. · Language: TypeScript · License: proprietary · Status: active · Status in practice: mature · First released: 2023-01-01

Links: homepage docs

Braintrust is a platform for evaluating LLM applications offline against datasets and scorers and for scoring production traces online without adding latency.

Description. Braintrust is a platform for building, evaluating, and monitoring LLM applications. Offline, each evaluation runs a task against a dataset of test cases and measures the output with scoring functions, producing an experiment that records inputs, outputs, scores, and metadata for comparing prompt and model versions. Online, configured scorers run against production traces automatically as they are logged, evaluating asynchronously in the background. This gives both pre-deploy regression checks and continuous production quality monitoring.

Agent loop shape. Braintrust evaluates rather than runs the application. Offline, an Eval pairs a dataset, a task function under test, and scoring functions; running it produces an experiment that records inputs, outputs, and scores for comparison across versions. Online, scorers run asynchronously against production traces as they arrive, logging quality metrics without regenerating output or adding latency to the application.

Primary use cases

offline evaluation of LLM applications against datasets
comparing prompt and model versions across experiments
catching regressions in CI
online scoring of production traces

flowchart TD fw["Braintrust"] fw --> p1["Eval Harness<br/>(core)"] fw --> p2["Scorer Live Monitoring<br/>(first-class)"] fw --> p3["LLM-as-Judge<br/>(first-class)"] fw --> p4["Prompt Versioning<br/>(supported)"]

Key concepts

Eval → eval-harness (docs) — The core offline unit that pairs a dataset, a task function under test, and scoring functions; running it produces an experiment for comparing versions.
Experiment (docs) — The permanent record produced by running an Eval, capturing inputs, outputs, scores, and metadata so prompt and model versions can be compared and regressions caught.
Scorers / autoevals → llm-as-judge (docs) — Scoring functions — code heuristics or LLM-as-a-judge prompts — that assign a quality score to outputs, with pre-built scorers shipped in the open-source autoevals library.
Online scoring → scorer-live-monitoring (docs) — Scorers configured to run asynchronously against production traces as they are logged, providing continuous quality monitoring without adding latency.
Playground → prompt-versioning (docs) — An interactive surface for testing prompts against sample inputs and saving them to a versioned prompt registry before rolling out to environments.

Braintrust

Neighbourhood

Alternatives & relatives

Listed as alternative by (3)

References

Provenance