Low-Code · Enterprise Platformsactive

Langfuse

Type: low-code · Vendor: Langfuse (Finto Technologies GmbH) · Language: TypeScript · License: MIT · Status: active · Status in practice: mature · First released: 2023

Links: homepage docs repo

Langfuse records production traces of LLM and agent applications and runs prompt management, datasets, and evaluations against them so teams can debug and measure their applications.

Description. Langfuse is an open-source LLM engineering platform built by Langfuse GmbH and is self-hostable. Applications send traces of their LLM calls, tool calls, and agent steps to Langfuse through SDKs and integrations, and the platform stores them for inspection. On top of those traces it provides prompt versioning, dataset-based experiments, and model-based evaluation, so it sits beside an application as an observability and evaluation backend rather than running the agent loop itself.

Agent loop shape. Langfuse does not run an agent loop; it is an out-of-band observability sink. The application's own loop emits spans for each LLM call, tool call, and step, and Langfuse ingests them as nested traces. Evaluators then run asynchronously over a sampled share of incoming traces, attaching scores back to them.

Primary use cases

  • tracing and observability for LLM and agent applications
  • prompt management and versioning
  • LLM-as-a-judge evaluation of production traces
  • dataset-based experiments and offline evaluation
  • cost and latency monitoring

Key concepts

  • Trace / Observation cost-observability (docs)The nested record Langfuse ingests for one application run; a trace groups observations (LLM generations, tool calls, retrieval and other spans), each carrying timing, inputs, outputs, and cost.
  • Prompt label prompt-versioning (docs)A named pointer (such as production or staging) that resolves to a specific prompt version in the registry, so deploying or rolling back a prompt is done by moving the label between versions.
  • Dataset / Experiment eval-harness (docs)A dataset is a collection of inputs and expected outputs; an experiment runs an application version against that dataset to score and compare it, which is how Langfuse supports offline regression testing.
  • Score llm-as-judge (docs)A numeric, boolean, or categorical value attached to a trace by an evaluator (LLM-as-a-judge, code evaluator, user feedback, or manual label), which is the unit Langfuse uses to quantify quality.

Patterns this low-code implements —

Neighbourhood

Click any neighbour to follow the lineage. Scroll to zoom, drag to pan.