RAG Microservice Inference Pipeline
also known as business plus LLM microservice, two-tier RAG serving
Serve a fine-tuned LLM behind two microservices with a clear split. The business microservice runs the advanced retrieval work: query rewriting, retrieval, reranking, and prompt assembly. It may also call a strong reference model, such as GPT-4, for steps the small fine-tune cannot do. The LLM microservice loads the fine-tuned model from the registry and answers a clean, ready-to-generate prompt. The split lets each microservice scale and change on its own. The business layer can move faster than the model. The LLM microservice can be reused by other callers.
Methodology process overview
Intent. Split LLM serving into a business microservice and an LLM microservice. The business side handles retrieval orchestration, prompt assembly, and an optional strong reference model. The LLM side loads a fine-tune from the registry and answers a clean prompt. Each side then scales and changes on its own.
When to apply. Use this when you deploy a fine-tuned LLM behind production retrieval, especially when the retrieval work is involved, with multi-step retrieval, reranking, query rewriting, and hybrid search, and the serving path must be both fast and cost-controlled. Do not use it for prototypes where one process is enough, since the split adds operational overhead. Don't apply it when only one client calls the LLM and the business layer is trivial. A single service is fine there.
Inputs
- Fine-tuned LLM in registry — A versioned model artefact you can load by tag from a model registry such as Comet, MLflow, or SageMaker.
- Retrieval substrate — A vector store, an optional keyword index, an optional reranker, and the documents you have indexed.
- Optional strong reference model — API access to a frontier model such as GPT-4 or Claude. Use it for orchestration steps the small fine-tune does poorly, such as query rewriting and planning.
- Contract between tiers — A spec for the prompt schema the business microservice sends and the response schema the LLM microservice returns.
Outputs
- Business microservice — The service that runs the advanced retrieval work: retrieval, reranking, query rewrites, prompt assembly, and optional reference-model calls. It emits a ready-to-generate prompt.
- LLM microservice — The service that loads the fine-tuned model from the registry and answers the prompt it receives. It is exposed as a REST API.
- Inter-service contract — A versioned schema that keeps the two tiers separate.
Steps (6)
Define the inter-service contract
Write down the prompt schema the business tier sends and the response schema the LLM tier returns. This contract is the only coupling. Version it on purpose.
Build the LLM microservice
A service that loads the fine-tuned model by registry tag, takes a prompt, and returns a structured response. It owns the model lifecycle, GPU operations, batching, and latency budgets. It exposes a REST or gRPC endpoint.
Build retrieval and reranking in the business microservice
Vector retrieval, optional keyword retrieval, optional cross-encoder reranking, and document hydration. You can swap out any step on its own.
usesAgentic RAGModular RAGCross-Encoder RerankingContextual Retrieval
Add optional strong reference model for orchestration
For orchestration steps the fine-tuned twin does poorly, such as query rewriting, multi-step planning, and complex tool routing, call a frontier model. Track cost per call type so the reference-model spend stays visible.
Assemble the prompt and call the LLM microservice
The business tier builds the final prompt from the retrieved documents, the user query, persona instructions, and any orchestration outputs. It sends that to the LLM microservice. Then it returns the response to the user, optionally after post-processing.
Operate, monitor, and evolve
Each tier has its own dashboards, service-level objectives, and on-call rotation. Promote a new fine-tune through the registry without redeploying the business tier. Change the retrieval logic without redeploying the model.
Framework-specific instructions
Pick a framework and generate a framework-targeted rewrite of this methodology's steps.
Choose framework
AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.
Principles
- Two tiers, one contract. The contract is the only coupling.
- The model lifecycle and the orchestration lifecycle move at different speeds. The split lets them.
- Strong reference models earn their cost line by line. Track spend per orchestration step.
- The LLM microservice is portable. Other clients can call it without inheriting your retrieval logic.
Known failure modes (3)
- ✕Orchestrator as Bottleneck
Business microservice holds the request thread while the LLM microservice generates — concurrency drops to one in-flight per worker.
- ✕Hidden State Coupling
Business tier reaches into the LLM tier's tokenizer state or model internals — the split breaks on the next model upgrade.
- ✕Vendor Lock-In
Coupling tightly to a single strong-reference-model provider for orchestration with no fallback — outages cascade.
Related patterns (10)
- ★★Business + LLM Microservice Split
Split an LLM application into a CPU-bound business microservice (retrieval, prompt assembly, orchestration) and a GPU-bound LLM microservice (only model.generate behind REST), so each tier scales on its own hardware budget.
- ★★FTI LLM Pipeline Split
Decompose an LLM/RAG system into three independently-deployable pipelines — feature, training, inference — communicating only via a feature store and a model registry.
- ★★Agentic RAG
Replace static retrieve-then-generate with autonomous agents that plan, choose sources, retrieve iteratively, reflect, and re-query.
- ★Modular RAG
Decompose RAG into a typed three-layer hierarchy of Module Types, Modules, and Operators so the pipeline (routing, scheduling, fusion, retrieval, post-retrieval, generation) can be rearranged per query rather than running a fixed linear retrieve-then-generate.
- ★★Cross-Encoder Reranking
After cheap bi-encoder or BM25 retrieval, rescore top-N candidates with a cross-encoder that jointly attends over (query, candidate).
- ★Contextual Retrieval
Prepend a short LLM-generated description to each chunk before embedding so the chunk carries its situating context.
- ★★Multi-Model Routing
Send each request to the cheapest model that can handle it well.
- ★★Structured Output
Constrain the model's output to conform to a JSON Schema (or similar typed shape).
- ★Scorer Live Monitoring
Score agent outputs asynchronously in production with non-blocking scorers that observe, alert, and log but do not regenerate the output.
- ★★Cost Observability
Surface per-request, per-user, and per-feature cost and token consumption to operators in near-real-time.
Related compositions (2)
- recipe · abstract shapeProduction RAG
Retrieval-grounded generation built to be defensible: hybrid retrieval, reranking, contextualised chunks, citations rendered to the user, and verification before the answer ships.
- recipe · abstract shapeProduction LLM Platform
Stand up a production LLM/RAG system whose data pipeline, model pipeline, and inference path scale and deploy independently.
Related methodologies (2)
- FTI Pipeline Architecture★★
Split a machine-learning or LLM system into three separate pipelines, joined only by a feature store and a model registry, so each one can scale, be swapped out, and be owned on its own.
- LLM Twin End-to-End Construction★
Produce a production-grade personalised LLM twin through a repeatable pipeline. The pipeline covers data collection, instruction-dataset generation, supervised fine-tuning, preference alignment, evaluation, deployment, and monitoring.
Sources (2)
LLM Engineer's Handbook
Ch 9 'RAG Inference Pipeline'; Ch 10 'Inference Pipeline Deployment' “Business microservice: contains the advanced RAG logic ... LLM microservice: It loads the fine-tuned LLM twin model from Comet's model registry”
LLM Engineer's Handbook — Summary & Notes (Christian B. B. Houmann)
“Business microservice in FastAPI to handle requests, retrieval, prompt assembly, calling the LLM service, and streaming the result ... LLM microservice using SageMaker and Hugging Face TGI, with a dedicated GPU instance to host the model”
Provenance
- Added to catalog:
- Last updated:
- Verification status: verified