Langfuse
Type: low-code · Vendor: Langfuse (Finto Technologies GmbH) · Language: TypeScript · License: MIT · Status: active · Status in practice: mature · First released: 2023
Langfuse records production traces of LLM and agent applications and runs prompt management, datasets, and evaluations against them so teams can debug and measure their applications.
Description. Langfuse is an open-source LLM engineering platform built by Langfuse GmbH and is self-hostable. Applications send traces of their LLM calls, tool calls, and agent steps to Langfuse through SDKs and integrations, and the platform stores them for inspection. On top of those traces it provides prompt versioning, dataset-based experiments, and model-based evaluation, so it sits beside an application as an observability and evaluation backend rather than running the agent loop itself.
Agent loop shape. Langfuse does not run an agent loop; it is an out-of-band observability sink. The application's own loop emits spans for each LLM call, tool call, and step, and Langfuse ingests them as nested traces. Evaluators then run asynchronously over a sampled share of incoming traces, attaching scores back to them.
Primary use cases
- tracing and observability for LLM and agent applications
- prompt management and versioning
- LLM-as-a-judge evaluation of production traces
- dataset-based experiments and offline evaluation
- cost and latency monitoring
Key concepts
- Trace / Observation → cost-observability (docs) — The nested record Langfuse ingests for one application run; a trace groups observations (LLM generations, tool calls, retrieval and other spans), each carrying timing, inputs, outputs, and cost.
- Prompt label → prompt-versioning (docs) — A named pointer (such as production or staging) that resolves to a specific prompt version in the registry, so deploying or rolling back a prompt is done by moving the label between versions.
- Dataset / Experiment → eval-harness (docs) — A dataset is a collection of inputs and expected outputs; an experiment runs an application version against that dataset to score and compare it, which is how Langfuse supports offline regression testing.
- Score → llm-as-judge (docs) — A numeric, boolean, or categorical value attached to a trace by an evaluator (LLM-as-a-judge, code evaluator, user feedback, or manual label), which is the unit Langfuse uses to quantify quality.
Patterns this low-code implements —
- ★★Prompt Versioning
Langfuse Prompt Management stores each prompt in a registry where every save creates a new immutable version number, and labels (production/staging) point at a chosen version so deploys and rollbacks…
- ★Sampled Prompt Trace Eval
Langfuse ingests full production traces and lets you attach LLM-as-a-judge evaluators that run on a configurable sampling percentage of traces so judge cost stays bounded as traffic grows.
- ★★LLM-as-Judge
Langfuse provides model-based evaluation where one LLM scores the outputs of the traced application; evaluator prompts produce numeric or categorical scores that are attached back to the trace.
- ★★Cost Observability
Langfuse tracks token usage and cost on each generation observation and surfaces cost, latency, and quality metrics in the dashboard so operators can monitor per-call and per-application spend.
- ★★Eval Harness
Langfuse Datasets hold collections of inputs and expected outputs, and Experiments run an application version against the dataset to test it systematically, so regressions can be measured across vers…
Neighbourhood
Click any neighbour to follow the lineage. Scroll to zoom, drag to pan.