Langfuse

Type: low-code · Vendor: Langfuse (Finto Technologies GmbH) · Language: TypeScript · License: MIT · Status: active · Status in practice: mature · First released: 2023

Links: homepage docs repo

Langfuse records production traces of LLM and agent applications and runs prompt management, datasets, and evaluations against them so teams can debug and measure their applications.

Description. Langfuse is an open-source LLM engineering platform built by Langfuse GmbH and is self-hostable. Applications send traces of their LLM calls, tool calls, and agent steps to Langfuse through SDKs and integrations, and the platform stores them for inspection. On top of those traces it provides prompt versioning, dataset-based experiments, and model-based evaluation, so it sits beside an application as an observability and evaluation backend rather than running the agent loop itself.

Agent loop shape. Langfuse does not run an agent loop; it is an out-of-band observability sink. The application's own loop emits spans for each LLM call, tool call, and step, and Langfuse ingests them as nested traces. Evaluators then run asynchronously over a sampled share of incoming traces, attaching scores back to them.

Primary use cases

tracing and observability for LLM and agent applications
prompt management and versioning
LLM-as-a-judge evaluation of production traces
dataset-based experiments and offline evaluation
cost and latency monitoring

flowchart TD fw["Langfuse"] fw --> p1["Prompt Versioning (first-class)"] fw --> p2["Sampled Prompt Trace Eval (first-class)"] fw --> p3["LLM-as-Judge (first-class)"] fw --> p4["Cost Observability (first-class)"] fw --> p5["Eval Harness (first-class)"]

Key concepts

Trace / Observation → cost-observability (docs) — The nested record Langfuse ingests for one application run; a trace groups observations (LLM generations, tool calls, retrieval and other spans), each carrying timing, inputs, outputs, and cost.
Prompt label → prompt-versioning (docs) — A named pointer (such as production or staging) that resolves to a specific prompt version in the registry, so deploying or rolling back a prompt is done by moving the label between versions.
Dataset / Experiment → eval-harness (docs) — A dataset is a collection of inputs and expected outputs; an experiment runs an application version against that dataset to score and compare it, which is how Langfuse supports offline regression testing.
Score → llm-as-judge (docs) — A numeric, boolean, or categorical value attached to a trace by an evaluator (LLM-as-a-judge, code evaluator, user feedback, or manual label), which is the unit Langfuse uses to quantify quality.