Realtime API When Batchable
also known as Synchronous API for Batch Workload, Premium API for Async Work
Anti-pattern: use the realtime/synchronous model API for workloads whose latency budget would permit batching, paying 2–10× the unit cost for no user-visible benefit.
Context
A backend job processes documents, generates embeddings, summarizes records, or runs nightly analyses. The user sees the result hours later — no human is waiting on each call. The team uses the realtime synchronous API because it was the first one their SDK exposed.
Problem
Realtime API pricing is 2–10× the batch tier on every major provider. For workloads where latency could be 1h or 24h, this is pure overspend. The team often is not aware the batch API exists, or rejected it early as 'complex'. Cost shows up as a flat line in the bill: '$N per million tokens' instead of 'half of $N per million tokens'.
Forces
- Realtime is the default API in most SDKs.
- Batch APIs require restructuring the job to submit-and-poll.
- Engineers default to the API they know rather than the one that matches the latency budget.
Example
A nightly job re-embeds 2M product descriptions. Runs through the realtime embeddings endpoint. Costs $4k/month. The same work via the batch endpoint with a 24h SLA costs $2k/month. The team only discovers when a cost-review asks why embedding cost grew with the catalog.
Diagram
Solution
Therefore:
Identify model calls whose results are consumed asynchronously. Submit them via the provider's batch API (50% cheaper at OpenAI, similar at Anthropic). Poll or webhook for completion. Reserve realtime for genuinely user-facing or sub-minute-latency workloads. Track 'realtime calls without realtime latency requirement' as a metric in cost-observability.
What this pattern forbids. No useful constraint; the missing constraint is latency-budget-aware API selection.
And the patterns that stand alongside it, or against it —
- complementsCost Observability★★— Surface per-request, per-user, and per-feature cost and token consumption to operators in near-real-time.
- complementsCost Gating★★— Block actions whose expected cost exceeds a threshold without explicit user (or operator) acknowledgement.
- complementsTop-Tier Model For Everything (Cost)✕— Anti-pattern: route every request through the highest-tier model regardless of difficulty, treating cost as a model-choice problem instead of a routing one.
- complementsPrompt Caching★★— Order prompts so the unchanging prefix can be cached by the provider, cutting per-call cost and latency.
- complementsTool Result Caching★★— Cache the result of expensive deterministic tool calls keyed by their arguments so repeat calls within a session return immediately.
Neighbourhood
Click any neighbour to follow the language. Scroll to zoom, drag to pan.