Business + LLM Microservice Split
also known as CPU/GPU Tier Split, Inference-Service Decoupling
Split an LLM application into a CPU-bound business microservice (retrieval, prompt assembly, orchestration) and a GPU-bound LLM microservice (only model.generate behind REST), so each tier scales on its own hardware budget.
Context
A production LLM application bundles retrieval, prompt assembly, post-processing, business logic, and the LLM inference call into a single service. The service autoscales as a unit. The LLM call needs GPU; the rest does not. The unified deployment pays GPU prices to autoscale the CPU-only parts.
Problem
Bundled deployments waste expensive hardware. As traffic grows, the autoscaler adds whole GPU pods to handle CPU-bound spikes in prompt assembly and retrieval, while genuine GPU-bound spikes drag the entire service. Maintenance is coupled: bumping the model means redeploying the business logic; bumping the retrieval code means restarting GPU pods. The single service is a strict generalisation that loses on cost, scaling, and deploy velocity.
Forces
- LLM inference needs GPU; retrieval and prompt assembly do not.
- Independent scaling axes (RPS, token throughput) have different load shapes.
- Coupled deploys slow both teams; decoupled deploys let model and business iterate independently.
- REST boundary adds one network hop per request — a measurable latency cost.
Example
A RAG support platform deploys a CPU FastAPI business service handling retrieval (Qdrant), prompt assembly, and tenant routing, plus a separate GPU LLM service hosting a fine-tuned model behind TGI. Traffic spike: CPU pods scale 5x for retrieval load, GPU pods scale 2x for inference load. A model swap (Llama-3-8B to Llama-3-70B) is a deploy in the LLM service only; the business service is unchanged.
Diagram
Solution
Therefore:
Define the LLM microservice's contract as a single REST endpoint: generate(prompt, params) → completion. Run it on GPU autoscaling on token-throughput metrics. Run everything else — retrieval, prompt templating, business logic, orchestration, output post-processing — in the CPU business service that calls the LLM service over REST. Bound the LLM service's tail latency with batching, queueing, and admission control. The business service can use multiple LLM service instances (different models, different providers) behind the same contract.
What this pattern forbids. An LLM application must not bundle GPU inference with CPU business logic in one service when scaling and deploy cadence diverge; the LLM call lives behind its own service contract.
The smaller patterns that complete this one —
- usesRate Limiting★★— Cap the number of requests, tokens, or tool calls per user (or session) within a time window.
And the patterns that stand alongside it, or against it —
- composes-withFTI LLM Pipeline Split★★— Decompose an LLM/RAG system into three independently-deployable pipelines — feature, training, inference — communicating only via a feature store and a model registry.
- complementsAgent Adapter★★— An interface layer connecting an agent's tool-calling protocol to heterogeneous external tools, normalizing their schemas into one the agent expects.
- complementsAugmented LLM★★— Build the foundational agent block as an LLM augmented with retrieval, tools, and memory that the model actively chooses to use, rather than a bare-model call.
- complementsPrompt Caching★★— Order prompts so the unchanging prefix can be cached by the provider, cutting per-call cost and latency.
Neighbourhood
Click any neighbour to follow the language. Scroll to zoom, drag to pan.