Full-Code · Orchestration Frameworksactive

Ragas (synthetic testset generation)

Type: full-code · Vendor: Exploding Gradients · Language: Python · License: Apache-2.0 · Status: active · Status in practice: mature · First released: 2023-07-01

Links: homepage docs repo

Ragas generates synthetic test sets across named question dimensions and scores RAG and agent outputs with LLM-based evaluation metrics.

Description. Ragas is a Python library for evaluating retrieval-augmented and agentic LLM applications. Its synthetic test data generator enumerates named question types (such as reasoning, conditioning, and multi-context) with a configurable distribution and named personas to produce a diverse evaluation set instead of relying on free-form LLM prompting that mode-collapses. Its metrics include LLM-based metrics that use a configured LLM to score outputs against criteria. Ragas is released under Apache 2.0.

Agent loop shape. From a set of source documents, Ragas synthesises test questions by enumerating question evolution types across a configurable distribution and seeding generation with named personas, producing a diverse held-out set. At evaluation time, each sample is scored by metrics, where LLM-based metrics issue one or more LLM calls to grade the output against criteria, yielding scores aligned with human judgement.

Primary use cases

  • synthetic test set generation for RAG
  • LLM-based evaluation of RAG and agent outputs
  • measuring retrieval and answer quality
  • regression testing of LLM pipelines

Key concepts

  • Question evolution types dimensional-synthetic-eval-set (docs)Named transformations (simple, reasoning, multi_context, conditional) applied with a configurable distribution to craft diverse questions from source documents.
  • Persona dimensional-synthetic-eval-set (docs)A named role with a description used to seed test-set generation so questions reflect different user viewpoints rather than one generic asker.
  • Faithfulness metric llm-as-judge (docs)An LLM-based metric that breaks the response into claims and scores the fraction supported by the retrieved context, measuring groundedness against hallucination.
  • EvaluationDataset eval-harness (docs)The collection of samples (queries, contexts, responses, references) that the evaluate() function runs metrics against to produce scores.

Patterns this full-code implements —

Neighbourhood

Click any neighbour to follow the lineage. Scroll to zoom, drag to pan.