Braintrust
Type: full-code · Vendor: Braintrust Data, Inc. · Language: TypeScript · License: proprietary · Status: active · Status in practice: mature · First released: 2023-01-01
Braintrust is a platform for evaluating LLM applications offline against datasets and scorers and for scoring production traces online without adding latency.
Description. Braintrust is a platform for building, evaluating, and monitoring LLM applications. Offline, each evaluation runs a task against a dataset of test cases and measures the output with scoring functions, producing an experiment that records inputs, outputs, scores, and metadata for comparing prompt and model versions. Online, configured scorers run against production traces automatically as they are logged, evaluating asynchronously in the background. This gives both pre-deploy regression checks and continuous production quality monitoring.
Agent loop shape. Braintrust evaluates rather than runs the application. Offline, an Eval pairs a dataset, a task function under test, and scoring functions; running it produces an experiment that records inputs, outputs, and scores for comparison across versions. Online, scorers run asynchronously against production traces as they arrive, logging quality metrics without regenerating output or adding latency to the application.
Primary use cases
- offline evaluation of LLM applications against datasets
- comparing prompt and model versions across experiments
- catching regressions in CI
- online scoring of production traces
Key concepts
- Eval → eval-harness (docs) — The core offline unit that pairs a dataset, a task function under test, and scoring functions; running it produces an experiment for comparing versions.
- Experiment (docs) — The permanent record produced by running an Eval, capturing inputs, outputs, scores, and metadata so prompt and model versions can be compared and regressions caught.
- Scorers / autoevals → llm-as-judge (docs) — Scoring functions — code heuristics or LLM-as-a-judge prompts — that assign a quality score to outputs, with pre-built scorers shipped in the open-source autoevals library.
- Online scoring → scorer-live-monitoring (docs) — Scorers configured to run asynchronously against production traces as they are logged, providing continuous quality monitoring without adding latency.
- Playground → prompt-versioning (docs) — An interactive surface for testing prompts against sample inputs and saving them to a versioned prompt registry before rolling out to environments.
Patterns this full-code implements —
- ★★Eval Harness
An offline Braintrust evaluation runs a task function against a dataset of test cases and scores the outputs, producing a permanent experiment record that captures inputs, outputs, and scores for com…
- ★Scorer Live Monitoring
Braintrust online scoring runs configured scorers against production logs automatically as they arrive, asynchronously in the background, observing and logging quality without adding latency or regen…
- ★★LLM-as-Judge
Braintrust scorers can be expressed as LLM-as-a-judge functions that use a language model and a natural-language rubric to score subjective output quality where deterministic exact-match metrics do n…
- ★★Prompt Versioning
Braintrust treats prompts as versioned artifacts in a registry: every change creates a new version automatically, prompts deploy independently of application code, and a previously validated version…
Neighbourhood
Click any neighbour to follow the lineage. Scroll to zoom, drag to pan.