Full-Code · Enterprise Platformsactive

Braintrust

Type: full-code · Vendor: Braintrust Data, Inc. · Language: TypeScript · License: proprietary · Status: active · Status in practice: mature · First released: 2023-01-01

Links: homepage docs

Braintrust is a platform for evaluating LLM applications offline against datasets and scorers and for scoring production traces online without adding latency.

Description. Braintrust is a platform for building, evaluating, and monitoring LLM applications. Offline, each evaluation runs a task against a dataset of test cases and measures the output with scoring functions, producing an experiment that records inputs, outputs, scores, and metadata for comparing prompt and model versions. Online, configured scorers run against production traces automatically as they are logged, evaluating asynchronously in the background. This gives both pre-deploy regression checks and continuous production quality monitoring.

Agent loop shape. Braintrust evaluates rather than runs the application. Offline, an Eval pairs a dataset, a task function under test, and scoring functions; running it produces an experiment that records inputs, outputs, and scores for comparison across versions. Online, scorers run asynchronously against production traces as they arrive, logging quality metrics without regenerating output or adding latency to the application.

Primary use cases

  • offline evaluation of LLM applications against datasets
  • comparing prompt and model versions across experiments
  • catching regressions in CI
  • online scoring of production traces

Key concepts

  • Eval eval-harness (docs)The core offline unit that pairs a dataset, a task function under test, and scoring functions; running it produces an experiment for comparing versions.
  • Experiment (docs)The permanent record produced by running an Eval, capturing inputs, outputs, scores, and metadata so prompt and model versions can be compared and regressions caught.
  • Scorers / autoevals llm-as-judge (docs)Scoring functions — code heuristics or LLM-as-a-judge prompts — that assign a quality score to outputs, with pre-built scorers shipped in the open-source autoevals library.
  • Online scoring scorer-live-monitoring (docs)Scorers configured to run asynchronously against production traces as they are logged, providing continuous quality monitoring without adding latency.
  • Playground prompt-versioning (docs)An interactive surface for testing prompts against sample inputs and saving them to a versioned prompt registry before rolling out to environments.

Patterns this full-code implements —

Neighbourhood

Click any neighbour to follow the lineage. Scroll to zoom, drag to pan.