Methodology · LLM-App Engineeringprovenverified

RAG Microservice Inference Pipeline

also known as business plus LLM microservice, two-tier RAG serving

Applies to: llm-apprag-systemagent

Tags: microserviceraginferencedeployment

Serve a fine-tuned LLM behind two microservices with a clear split. The business microservice runs the advanced retrieval work: query rewriting, retrieval, reranking, and prompt assembly. It may also call a strong reference model, such as GPT-4, for steps the small fine-tune cannot do. The LLM microservice loads the fine-tuned model from the registry and answers a clean, ready-to-generate prompt. The split lets each microservice scale and change on its own. The business layer can move faster than the model. The LLM microservice can be reused by other callers.

Methodology process overview

Intent. Split LLM serving into a business microservice and an LLM microservice. The business side handles retrieval orchestration, prompt assembly, and an optional strong reference model. The LLM side loads a fine-tune from the registry and answers a clean prompt. Each side then scales and changes on its own.

When to apply. Use this when you deploy a fine-tuned LLM behind production retrieval, especially when the retrieval work is involved, with multi-step retrieval, reranking, query rewriting, and hybrid search, and the serving path must be both fast and cost-controlled. Do not use it for prototypes where one process is enough, since the split adds operational overhead. Don't apply it when only one client calls the LLM and the business layer is trivial. A single service is fine there.

Inputs

  • Fine-tuned LLM in registryA versioned model artefact you can load by tag from a model registry such as Comet, MLflow, or SageMaker.
  • Retrieval substrateA vector store, an optional keyword index, an optional reranker, and the documents you have indexed.
  • Optional strong reference modelAPI access to a frontier model such as GPT-4 or Claude. Use it for orchestration steps the small fine-tune does poorly, such as query rewriting and planning.
  • Contract between tiersA spec for the prompt schema the business microservice sends and the response schema the LLM microservice returns.

Outputs

  • Business microserviceThe service that runs the advanced retrieval work: retrieval, reranking, query rewrites, prompt assembly, and optional reference-model calls. It emits a ready-to-generate prompt.
  • LLM microserviceThe service that loads the fine-tuned model from the registry and answers the prompt it receives. It is exposed as a REST API.
  • Inter-service contractA versioned schema that keeps the two tiers separate.

Steps (6)

  1. Define the inter-service contract

    Write down the prompt schema the business tier sends and the response schema the LLM tier returns. This contract is the only coupling. Version it on purpose.

    usesBusiness + LLM Microservice Split

  2. Build the LLM microservice

    A service that loads the fine-tuned model by registry tag, takes a prompt, and returns a structured response. It owns the model lifecycle, GPU operations, batching, and latency budgets. It exposes a REST or gRPC endpoint.

    usesFTI LLM Pipeline SplitStructured Output

  3. Build retrieval and reranking in the business microservice

    Vector retrieval, optional keyword retrieval, optional cross-encoder reranking, and document hydration. You can swap out any step on its own.

    usesAgentic RAGModular RAGCross-Encoder RerankingContextual Retrieval

  4. Add optional strong reference model for orchestration

    For orchestration steps the fine-tuned twin does poorly, such as query rewriting, multi-step planning, and complex tool routing, call a frontier model. Track cost per call type so the reference-model spend stays visible.

    usesMulti-Model RoutingCost Observability

  5. Assemble the prompt and call the LLM microservice

    The business tier builds the final prompt from the retrieved documents, the user query, persona instructions, and any orchestration outputs. It sends that to the LLM microservice. Then it returns the response to the user, optionally after post-processing.

  6. Operate, monitor, and evolve

    Each tier has its own dashboards, service-level objectives, and on-call rotation. Promote a new fine-tune through the registry without redeploying the business tier. Change the retrieval logic without redeploying the model.

    usesScorer Live MonitoringCost Observability

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

  • Two tiers, one contract. The contract is the only coupling.
  • The model lifecycle and the orchestration lifecycle move at different speeds. The split lets them.
  • Strong reference models earn their cost line by line. Track spend per orchestration step.
  • The LLM microservice is portable. Other clients can call it without inheriting your retrieval logic.

Known failure modes (3)

Related patterns (10)

Related compositions (2)

Related methodologies (2)

Sources (2)

Provenance

  • Added to catalog:
  • Last updated:
  • Verification status: verified