Realtime API When Batchable

also known as Synchronous API for Batch Workload, Premium API for Async Work

Anti-pattern: use the realtime/synchronous model API for workloads whose latency budget would permit batching, paying 2–10× the unit cost for no user-visible benefit.

Context

A backend job processes documents, generates embeddings, summarizes records, or runs nightly analyses. The user sees the result hours later — no human is waiting on each call. The team uses the realtime synchronous API because it was the first one their SDK exposed.

Problem

Realtime API pricing is 2–10× the batch tier on every major provider. For workloads where latency could be 1h or 24h, this is pure overspend. The team often is not aware the batch API exists, or rejected it early as 'complex'. Cost shows up as a flat line in the bill: '$N per million tokens' instead of 'half of $N per million tokens'.

Forces

Realtime is the default API in most SDKs.
Batch APIs require restructuring the job to submit-and-poll.
Engineers default to the API they know rather than the one that matches the latency budget.

Example

A nightly job re-embeds 2M product descriptions. Runs through the realtime embeddings endpoint. Costs $4k/month. The same work via the batch endpoint with a 24h SLA costs $2k/month. The team only discovers when a cost-review asks why embedding cost grew with the catalog.

Diagram

flowchart TD Job[Nightly batch job] --> RT[Realtime API] RT --> Cost[2-10x batch cost] Cost --> Invis[Hidden in monthly bill] classDef bad fill:#fee,stroke:#c33; class RT,Cost,Invis bad;

Solution

Therefore:

Identify model calls whose results are consumed asynchronously. Submit them via the provider's batch API (50% cheaper at OpenAI, similar at Anthropic). Poll or webhook for completion. Reserve realtime for genuinely user-facing or sub-minute-latency workloads. Track 'realtime calls without realtime latency requirement' as a metric in cost-observability.

What this pattern forbids. No useful constraint; the missing constraint is latency-budget-aware API selection.

The patterns that counter or replace it —

complementsCost Observability★★— Surface per-request, per-user, and per-feature cost and token consumption to operators in near-real-time.
complementsCost Gating★★— Block actions whose expected cost exceeds a threshold without explicit user (or operator) acknowledgement.
complementsTop-Tier Model For Everything (Cost)✕— Anti-pattern: route every request through the highest-tier model regardless of difficulty, treating cost as a model-choice problem instead of a routing one.
complementsPrompt Caching★★— Order prompts so the unchanging prefix can be cached by the provider, cutting per-call cost and latency.
complementsTool Result Caching★★— Cache the result of expensive deterministic tool calls keyed by their arguments so repeat calls within a session return immediately.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

References

LLM APIコスト削減の落とし穴
blog

Provenance

Source: patterns/realtime-when-batchable.md on GitHub · commit 0f962e5 · view history
Added to catalog: 2026-05-23
Last updated: 2026-05-23
Contribute: open an issue or PR at github.com/agentpatternscatalog/patterns.