LangSmith

Type: app · Vendor: LangChain Inc. · Language: TypeScript · License: proprietary · Status: active · Status in practice: mature · First released: 2023

Links: homepage docs

LangSmith is a hosted platform for tracing, evaluating, and monitoring LLM applications across development and production.

Description. LangSmith captures traces of LLM and agent runs and evaluates them with code rules, human review, and LLM-as-judge evaluators. It runs offline experiments to gate during development and online evaluators that run automatically on production traces for monitoring and alerting. Failing production traces can be added back to datasets to drive targeted offline experiments.

Agent loop shape. LangSmith does not run the application's agent loop; it observes and evaluates it. Application runs send traces to LangSmith, where evaluators score them. In the offline track, evaluators run over curated datasets as experiments. In the online track, evaluators run automatically on incoming production runs or threads to provide monitoring and alerting, and flagged runs feed back into datasets.

Primary use cases

tracing LLM and agent runs
offline evaluation experiments before deploy
online evaluation and monitoring of production traffic

flowchart TD fw["LangSmith"] fw --> p1["Dual Evaluation (Offline + Online)<br/>(first-class)"] fw --> p2["LLM-as-Judge<br/>(first-class)"] fw --> p3["Decision Log<br/>(first-class)"] fw --> p4["Prompt Versioning<br/>(supported)"]

Key concepts

Trace and run → decision-log (docs) — A run is a single span of work (an LLM call, a retrieval, a formatting step) and a trace is the collection of runs for one end-to-end operation; together they form the observability record LangSmith stores per request.
Experiment → dual-evaluation-offline-online (docs) — An offline evaluation run that applies a set of evaluators over a curated dataset to score an application version before deploy, the gate side of LangSmith's two evaluation tracks.
Online evaluator → dual-evaluation-offline-online (docs) — An evaluator configured to run automatically on incoming production runs or threads (optionally sampled) to provide live monitoring, anomaly detection, and alerting on production traffic.
Annotation queue (docs) — A human-review surface where runs are queued for people to label or score, complementing the automated code-rule and LLM-as-judge evaluators.

LangSmith

Neighbourhood

Anti-patterns avoided

Alternatives & relatives

Listed as alternative by (4)

References

Provenance