Rate Limiting
also known as Throttling, Quota Enforcement
Cap the number of requests, tokens, or tool calls per user (or session) within a time window.
This pattern helps complete certain larger patterns —
- used-byAgent Middleware Chain★— Wrap every model call, tool call, and memory access in a composable pre/execute/post interceptor pipeline so cross-cutting concerns attach without touching agent or orchestrator code.
- used-byBusiness + LLM Microservice Split★★— Split an LLM application into a CPU-bound business microservice (retrieval, prompt assembly, orchestration) and a GPU-bound LLM microservice (only model.generate behind REST), so each tier scales on its own hardware budget.
Context
A team runs a multi-tenant agent product where many users share the same backend resources — token budgets with model providers, tool API quotas, compute capacity. Any one of those users can, accidentally or maliciously, send much more traffic than the operator priced for: a runaway script, a compromised account, or simply a single power user opening hundreds of concurrent sessions.
Problem
Without per-identity limits, a single caller can drain the month's token budget in a few hours, hit downstream provider rate limits and starve every other user, or simply run up an unbounded bill the operator did not authorise. Imposing one global cap is too blunt — it punishes everyone for one bad actor — and trusting users to behave reasonably has never worked at scale. The team is forced to choose between generous limits that hurt cost and tight limits that hurt legitimate users.
Forces
- Generous limits hurt cost; tight limits hurt UX.
- Per-tier limits add complexity.
- Distributed counters need coordination.
Example
A coding assistant ships a free tier and within a week one signed-up account opens 400 concurrent agent loops, draining the month's token budget in two hours. The team adds per-identity token-bucket counters at three horizons (per minute, per hour, per day) at the API gateway and inside the agent loop itself. Over-budget callers get a clear 429 naming which window tripped and when it resets. Cost stops being a single hostile user away from blowing up.
Diagram
Solution
Therefore:
Define limits per identity at multiple horizons (per minute, per hour, per day). Use token-bucket or sliding-window counters. Apply at API gateway and at agent loop level. Surface limit hits to the user clearly.
What this pattern forbids. Requests beyond the limit are rejected or queued; no code path may bypass the limiter.
And the patterns that stand alongside it, or against it —
- complementsCircuit Breaker★★— Stop calling a failing dependency for a cooldown period after error rates exceed a threshold.
- complementsCost Gating★★— Block actions whose expected cost exceeds a threshold without explicit user (or operator) acknowledgement.
- complementsEvent-Driven Agent★★— Trigger the agent on external events (webhooks, message queues, file changes) instead of user requests or schedules.
- complementsKill Switch★— Provide an out-of-band control plane to halt running agent instances without redeploy.
- complementsInfrastructure Burst Bottleneck (Agent Scale-Out)✕— Anti-pattern: deploy agents whose scale-out behavior triggers sudden data-and-compute bursts that on-prem or under-provisioned cloud infrastructure cannot absorb; agents work at small scale and freeze in production.
- complementsNaive Retry Without Backoff✕— Anti-pattern: retry failed model or tool calls immediately, amplifying load on systems that are already failing.
- complementsCrawler Dispatcher★★— Route each incoming URL to a domain-specific crawler through a central dispatcher mapping URL patterns to registered crawler classes.
Neighbourhood
Click any neighbour to follow the language. Scroll to zoom, drag to pan.