Ragas (synthetic testset generation)
Type: full-code · Vendor: Exploding Gradients · Language: Python · License: Apache-2.0 · Status: active · Status in practice: mature · First released: 2023-07-01
Ragas generates synthetic test sets across named question dimensions and scores RAG and agent outputs with LLM-based evaluation metrics.
Description. Ragas is a Python library for evaluating retrieval-augmented and agentic LLM applications. Its synthetic test data generator enumerates named question types (such as reasoning, conditioning, and multi-context) with a configurable distribution and named personas to produce a diverse evaluation set instead of relying on free-form LLM prompting that mode-collapses. Its metrics include LLM-based metrics that use a configured LLM to score outputs against criteria. Ragas is released under Apache 2.0.
Agent loop shape. From a set of source documents, Ragas synthesises test questions by enumerating question evolution types across a configurable distribution and seeding generation with named personas, producing a diverse held-out set. At evaluation time, each sample is scored by metrics, where LLM-based metrics issue one or more LLM calls to grade the output against criteria, yielding scores aligned with human judgement.
Primary use cases
- synthetic test set generation for RAG
- LLM-based evaluation of RAG and agent outputs
- measuring retrieval and answer quality
- regression testing of LLM pipelines
Key concepts
- Question evolution types → dimensional-synthetic-eval-set (docs) — Named transformations (simple, reasoning, multi_context, conditional) applied with a configurable distribution to craft diverse questions from source documents.
- Persona → dimensional-synthetic-eval-set (docs) — A named role with a description used to seed test-set generation so questions reflect different user viewpoints rather than one generic asker.
- Faithfulness metric → llm-as-judge (docs) — An LLM-based metric that breaks the response into claims and scores the fraction supported by the retrieved context, measuring groundedness against hallucination.
- EvaluationDataset → eval-harness (docs) — The collection of samples (queries, contexts, responses, references) that the evaluate() function runs metrics against to produce scores.
Patterns this full-code implements —
- ★Dimensional Synthetic Eval Set
Ragas rejects naive LLM prompting (which mode-collapses) and instead enumerates named question evolution types (simple, reasoning, multi_context, conditional) with a configurable distribution plus na…
- ★★LLM-as-Judge
Ragas scores outputs with LLM-based metrics that use an LLM underneath to perform the evaluation, issuing one or more LLM calls per sample to grade the output against criteria, an approach it states…
- ★★Eval Harness
The evaluate() function runs a held-out EvaluationDataset of samples through a chosen list of metrics and returns a per-metric score for the pipeline, the offline batch run used to measure and compar…
Neighbourhood
Click any neighbour to follow the lineage. Scroll to zoom, drag to pan.