Self-RAG
also known as Self-Reflective RAG
Fine-tune the model to emit reflection tokens that decide when to retrieve, evaluate retrieved relevance, and assess generated support.
This pattern helps complete certain larger patterns —
- specialisesAgentic RAG★★— Replace static retrieve-then-generate with autonomous agents that plan, choose sources, retrieve iteratively, reflect, and re-query.
Context
A team is building a retrieval-augmented system where retrieval is not always the right thing to do. Some queries are easy and can be answered from the model's parametric knowledge; others genuinely require fresh evidence from the corpus. Even when retrieval happens, the chunks returned may not be relevant, and even when they are relevant, the final generation may not actually be supported by them. The team needs the model itself to reason about each of these decisions per request, instead of forcing every query through the same fixed pipeline.
Problem
Static retrieve-then-generate pipelines retrieve regardless of whether retrieval is needed, and they generate regardless of whether the retrieved evidence is actually relevant or whether the generation is grounded in it. Cheap queries that did not need retrieval still pay for it. Bad retrievals still feed the generator. Ungrounded generations still ship to the user. Without explicit reflective steps where the model decides whether to retrieve, judges the relevance of what it retrieved, and checks whether its own draft is supported by the evidence, the system both wastes calls and quietly admits hallucinations into production.
Forces
- Token vocabulary expansion adds training complexity.
- Reflection tokens must be enforced at inference, not just trained.
- Self-evaluation correlates with the model's blind spots.
Example
A document-QA agent always retrieves three chunks per query, even for trivial questions, and always generates an answer regardless of whether the retrieved chunks support one. The team fine-tunes a Self-RAG variant that emits inline reflection tokens: `[Retrieve]` decides per-query whether to retrieve, `[IsRel]` filters retrieved chunks, `[IsSup]` checks whether the generated claim is supported. Useless retrievals drop and unsupported answers are flagged before they reach the user.
Diagram
Solution
Therefore:
A critic model is first trained to label data with reflection tokens. The generator is then fine-tuned on the labeled data to emit four reflection tokens inline at inference: [Retrieve], [IsRel] (is retrieved evidence relevant?), [IsSup] (is generation supported?), [IsUse] (is generation useful?). The host enforces the reflection grammar and uses tokens to control flow.
What this pattern forbids. Generation steps are gated by the reflection grammar; the model cannot generate freely without emitting the appropriate reflection tokens.
The smaller patterns that complete this one —
- usesReflection★★— Have the model review its own output and produce a revised version in one or more passes.
Neighbourhood
Click any neighbour to follow the language. Scroll to zoom, drag to pan.