Methodology · LLM-App Engineering

RAG Microservice Inference Pipeline

Split LLM serving into a business microservice and an LLM microservice. The business side handles retrieval orchestration, prompt assembly, and an optional strong reference model. The LLM side loads a fine-tune from the registry and answers a clean prompt. Each side then scales and changes on its own.

Description

Serve a fine-tuned LLM behind two microservices with a clear split. The business microservice runs the advanced retrieval work: query rewriting, retrieval, reranking, and prompt assembly. It may also call a strong reference model, such as GPT-4, for steps the small fine-tune cannot do. The LLM microservice loads the fine-tuned model from the registry and answers a clean, ready-to-generate prompt. The split lets each microservice scale and change on its own. The business layer can move faster than the model. The LLM microservice can be reused by other callers.

When to apply

Use this when you deploy a fine-tuned LLM behind production retrieval, especially when the retrieval work is involved, with multi-step retrieval, reranking, query rewriting, and hybrid search, and the serving path must be both fast and cost-controlled. Do not use it for prototypes where one process is enough, since the split adds operational overhead. Don't apply it when only one client calls the LLM and the business layer is trivial. A single service is fine there.

What it involves

Define the inter-service contract
Build the LLM microservice
Build retrieval and reranking in the business microservice
Add optional strong reference model for orchestration
Assemble the prompt and call the LLM microservice
Operate, monitor, and evolve

Open the full interactive page →

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Description

When to apply

What it involves

Related