Methodology · LLM-App Engineeringprovenverified

RAG Microservice Inference Pipeline

also known as business plus LLM microservice, two-tier RAG serving

Applies to: llm-apprag-systemagent

Tags: microserviceraginferencedeployment

Serve a fine-tuned LLM behind two microservices with a clear split. The business microservice runs the advanced retrieval work: query rewriting, retrieval, reranking, and prompt assembly. It may also call a strong reference model, such as GPT-4, for steps the small fine-tune cannot do. The LLM microservice loads the fine-tuned model from the registry and answers a clean, ready-to-generate prompt. The split lets each microservice scale and change on its own. The business layer can move faster than the model. The LLM microservice can be reused by other callers.

Methodology process overview

graph LR user[User query] --> biz[Business microservice] biz --> rewrite[Query rewriting] rewrite --> retrieve[Vector retrieval] retrieve --> vdb[(Vector DB)] retrieve --> rerank[Cross-encoder rerank] rerank --> assemble[Prompt assembly] biz -.optional.-> ref[Strong reference model] ref -.-> assemble assemble --> llm[LLM microservice] reg[(Model registry)] --> llm llm --> resp[Generated response] resp --> biz biz --> user contract[Inter-service contract] -.-> biz contract -.-> llm

Intent. Split LLM serving into a business microservice and an LLM microservice. The business side handles retrieval orchestration, prompt assembly, and an optional strong reference model. The LLM side loads a fine-tune from the registry and answers a clean prompt. Each side then scales and changes on its own.

When to apply. Use this when you deploy a fine-tuned LLM behind production retrieval, especially when the retrieval work is involved, with multi-step retrieval, reranking, query rewriting, and hybrid search, and the serving path must be both fast and cost-controlled. Do not use it for prototypes where one process is enough, since the split adds operational overhead. Don't apply it when only one client calls the LLM and the business layer is trivial. A single service is fine there.

Example scenario

A SaaS company was building an internal-knowledge copilot for its customer-success team. It deployed its fine-tuned Mistral 7B twin behind exactly this split. The business microservice was written in Python on FastAPI. It owned query rewriting, calling GPT-4 for the harder rewrites, vector retrieval against a Qdrant cluster of 2 million chunked help-centre articles and Slack threads, cross-encoder reranking with a small Cohere reranker, and final prompt assembly. The LLM microservice was written in Go around a vLLM server. It loaded the fine-tuned Mistral from Comet's registry by tag and served structured JSON responses behind a gRPC endpoint. The split paid off across three separate changes in the first six months. The platform team upgraded the LLM microservice from vLLM 0.4 to 0.6 and switched GPUs from A10G to L4, with no change to the business tier. The business team replaced the Cohere reranker with a self-hosted bge-reranker-v2 and adjusted the retrieval window, without redeploying the LLM tier. A new Slack-bot client began calling the LLM microservice directly for a different use case that did not need the customer-success retrieval context. The microservice was portable because the contract was clean. Cost observability tracked GPT-4 query-rewrite spend apart from inference spend. The team found GPT-4 rewrites were 35% of total cost. After evaluation showed equal quality, they replaced two-thirds of them with a smaller model. The lesson the team kept: the contract was the architecture. Everything else was implementation that could change.

Inputs

Fine-tuned LLM in registry — A versioned model artefact you can load by tag from a model registry such as Comet, MLflow, or SageMaker.
Retrieval substrate — A vector store, an optional keyword index, an optional reranker, and the documents you have indexed.
Optional strong reference model — API access to a frontier model such as GPT-4 or Claude. Use it for orchestration steps the small fine-tune does poorly, such as query rewriting and planning.
Contract between tiers — A spec for the prompt schema the business microservice sends and the response schema the LLM microservice returns.

Outputs

Business microservice — The service that runs the advanced retrieval work: retrieval, reranking, query rewrites, prompt assembly, and optional reference-model calls. It emits a ready-to-generate prompt.
LLM microservice — The service that loads the fine-tuned model from the registry and answers the prompt it receives. It is exposed as a REST API.
Inter-service contract — A versioned schema that keeps the two tiers separate.

Steps (6)

Define the inter-service contract
Write down the prompt schema the business tier sends and the response schema the LLM tier returns. This contract is the only coupling. Version it on purpose.
usesBusiness + LLM Microservice Split
Build the LLM microservice
A service that loads the fine-tuned model by registry tag, takes a prompt, and returns a structured response. It owns the model lifecycle, GPU operations, batching, and latency budgets. It exposes a REST or gRPC endpoint.
usesFTI LLM Pipeline Split Structured Output
Build retrieval and reranking in the business microservice
Vector retrieval, optional keyword retrieval, optional cross-encoder reranking, and document hydration. You can swap out any step on its own.
usesAgentic RAG Modular RAG Cross-Encoder Reranking Contextual Retrieval
Add optional strong reference model for orchestration
For orchestration steps the fine-tuned twin does poorly, such as query rewriting, multi-step planning, and complex tool routing, call a frontier model. Track cost per call type so the reference-model spend stays visible.
usesMulti-Model Routing Cost Observability
Assemble the prompt and call the LLM microservice
The business tier builds the final prompt from the retrieved documents, the user query, persona instructions, and any orchestration outputs. It sends that to the LLM microservice. Then it returns the response to the user, optionally after post-processing.
Operate, monitor, and evolve
Each tier has its own dashboards, service-level objectives, and on-call rotation. Promote a new fine-tune through the registry without redeploying the business tier. Change the retrieval logic without redeploying the model.
usesScorer Live Monitoring Cost Observability

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

Two tiers, one contract. The contract is the only coupling.
The model lifecycle and the orchestration lifecycle move at different speeds. The split lets them.
Strong reference models earn their cost line by line. Track spend per orchestration step.
The LLM microservice is portable. Other clients can call it without inheriting your retrieval logic.

RAG Microservice Inference Pipeline

Methodology process overview

Steps (6)

Define the inter-service contract

Build the LLM microservice

Build retrieval and reranking in the business microservice

Add optional strong reference model for orchestration

Assemble the prompt and call the LLM microservice

Operate, monitor, and evolve

Framework-specific instructions

Principles

Known failure modes (3)

Related patterns (10)

Related compositions (2)

Related methodologies (2)

Sources (2)

Provenance