Full-Code · Orchestration Frameworksactive

Ragas (synthetic testset generation)

Type: full-code · Vendor: Exploding Gradients · Language: Python · License: Apache-2.0 · Status: active · Status in practice: mature · First released: 2023-07-01

Links: homepage docs repo

Ragas generates synthetic test sets across named question dimensions and scores RAG and agent outputs with LLM-based evaluation metrics.

Description. Ragas is a Python library for evaluating retrieval-augmented and agentic LLM applications. Its synthetic test data generator enumerates named question types (such as reasoning, conditioning, and multi-context) with a configurable distribution and named personas to produce a diverse evaluation set instead of relying on free-form LLM prompting that mode-collapses. Its metrics include LLM-based metrics that use a configured LLM to score outputs against criteria. Ragas is released under Apache 2.0.

Agent loop shape. From a set of source documents, Ragas synthesises test questions by enumerating question evolution types across a configurable distribution and seeding generation with named personas, producing a diverse held-out set. At evaluation time, each sample is scored by metrics, where LLM-based metrics issue one or more LLM calls to grade the output against criteria, yielding scores aligned with human judgement.

Primary use cases

synthetic test set generation for RAG
LLM-based evaluation of RAG and agent outputs
measuring retrieval and answer quality
regression testing of LLM pipelines

flowchart TD fw["Ragas (synthetic testset generation)"] fw --> p1["Dimensional Synthetic Eval Set<br/>(core)"] fw --> p2["LLM-as-Judge<br/>(core)"] fw --> p3["Eval Harness<br/>(supported)"]

Key concepts

Question evolution types → dimensional-synthetic-eval-set (docs) — Named transformations (simple, reasoning, multi_context, conditional) applied with a configurable distribution to craft diverse questions from source documents.
Persona → dimensional-synthetic-eval-set (docs) — A named role with a description used to seed test-set generation so questions reflect different user viewpoints rather than one generic asker.
Faithfulness metric → llm-as-judge (docs) — An LLM-based metric that breaks the response into claims and scores the fraction supported by the retrieved context, measuring groundedness against hallucination.
EvaluationDataset → eval-harness (docs) — The collection of samples (queries, contexts, responses, references) that the evaluate() function runs metrics against to produce scores.

Ragas (synthetic testset generation)

Neighbourhood

Alternatives & relatives

Listed as alternative by (2)

References

Provenance