RAG Microservice Inference Pipeline
Split LLM serving into a business microservice and an LLM microservice. The business side handles retrieval orchestration, prompt assembly, and an optional strong reference model. The LLM side loads a fine-tune from the registry and answers a clean prompt. Each side then scales and changes on its own.
Description
Serve a fine-tuned LLM behind two microservices with a clear split. The business microservice runs the advanced retrieval work: query rewriting, retrieval, reranking, and prompt assembly. It may also call a strong reference model, such as GPT-4, for steps the small fine-tune cannot do. The LLM microservice loads the fine-tuned model from the registry and answers a clean, ready-to-generate prompt. The split lets each microservice scale and change on its own. The business layer can move faster than the model. The LLM microservice can be reused by other callers.
When to apply
Use this when you deploy a fine-tuned LLM behind production retrieval, especially when the retrieval work is involved, with multi-step retrieval, reranking, query rewriting, and hybrid search, and the serving path must be both fast and cost-controlled. Do not use it for prototypes where one process is enough, since the split adds operational overhead. Don't apply it when only one client calls the LLM and the business layer is trivial. A single service is fine there.
What it involves
- Define the inter-service contract
- Build the LLM microservice
- Build retrieval and reranking in the business microservice
- Add optional strong reference model for orchestration
- Assemble the prompt and call the LLM microservice
- Operate, monitor, and evolve
Open the full interactive page →
Diagram, neighbourhood map, code examples, related patterns and full provenance.