Demo-to-Production Cliff

also known as Pilot-to-Production Failure, Scale-Gap Failure, Die Demo funktioniert — die Produktion nicht

Anti-pattern: ship a demo-validated agent straight into production without a frozen eval, cost ceiling, loop-detector, or named oncall, then act surprised when accuracy drops and cost runs away.

Context

An agent has been built and demoed successfully against a curated set of inputs in a clean environment. Stakeholders are convinced; the model 'works'. The team now wants to ship it to production traffic — variable input distributions, real concurrency, real rate limits, real cost meters, real adversarial inputs.

Problem

Demo conditions hide most of what kills agents in production. Latency at low concurrency does not predict p99 under load. A 95% pass rate on a hand-picked eval does not predict accuracy on the long tail. Token spend on a few demo turns does not predict the cost of an undetected recursive multi-agent conversation running overnight. Industry surveys (88% of agents never reach production; 70–95% failure rate among those that do) consistently attribute the gap to missing evaluation infrastructure, monitoring, dedicated ownership — not to model quality. The t3n analysis names this directly: it is not the model that fails, it is the architecture around it.

Forces

Demos reward speed-to-impressive-output; production rewards stability under load that the demo never sees.
Per-query cost is invisible until traffic scales; recursive loops between agents can drain a budget in days without tripping any classical alert.
Eval suites that worked in development are rarely re-run as the model, tools, or prompt drift; what looked safe at v1 is unmeasured at v17.
Ownership of agent operations sits between the ML, platform, and product teams; without a named owner, monitoring and cost gating fall through the gap.

Example

A four-agent research assistant nails its demo: clean queries, three-agent rounds, ~$0.40 per answer, 8-second latency. It ships. Two weeks later, finance flags a $47k spend over 11 days. Investigation finds one of the agent pairs has been in a self-perpetuating clarification loop on a class of malformed inputs that never appeared in the demo set; no step-budget, no cost-observability dashboard, no oncall. Postmortem conclusion: the model worked; the architecture around it had no production-readiness gates.

Diagram

flowchart TD Demo[Curated demo: 95% pass, 2s latency, $0.40/run] --> Ship[Ship to production] Ship --> Bad{Production-readiness gates present?} Bad -- no --> Cliff[Cliff: accuracy drops, p99 explodes, cost runs away] Cliff --> Inc[Incident — undetected loop, blown budget, missed SLA] Bad -- yes --> Gated[Frozen eval + cost ceiling + loop-detector + oncall] Gated --> Safe[Production with bounded risk] classDef bad fill:#fee,stroke:#c33; class Cliff,Inc bad;

Solution

Therefore:

Treat the demo as the beginning of evaluation, not its conclusion. Stand up an eval harness with a frozen rubric before production traffic; gate deploys on it. Add cost-observability per agent-run and a hard budget ceiling per session. Add loop-detection (typed-tool-loop-detector or step-budget) to catch recursive multi-agent chatter. Replay production traffic in a shadow-canary before promotion. Name an oncall for the agent system the same way as for any other production service.

What this pattern forbids. No useful constraint; the missing constraint is mandatory production-readiness gating (frozen eval, cost ceiling, loop-detector, named oncall) before any agent ships to live traffic.

The smaller patterns that complete this one —

generalisesDemo-Production Cliff (Multi-Agent)✕— Anti-pattern: multi-agent pilot benchmarks at 95% accuracy / 2s latency on a curated demo set, then degrades to ~80% / 40s under realistic 10k-RPD load.

The patterns that counter or replace it —

complementsPerma-Beta✕— Anti-pattern: ship the agent in 'beta' indefinitely so that quality regressions are someone else's problem.
complementsUnbounded Loop✕— Anti-pattern: run the agent loop without a step budget and let model self-termination decide.
alternative-toCost Observability★★— Surface per-request, per-user, and per-feature cost and token consumption to operators in near-real-time.
alternative-toEval as Contract★★— Treat the eval suite as the contract the agent must satisfy; releases ship only if evals pass.
alternative-toShadow Canary★★— Run a candidate agent version in shadow alongside the champion, comparing outputs without affecting users.
complementsErrors Swept Under the Rug✕— Anti-pattern: scrub failed actions, stack traces, and error observations from the agent's own context so the trace looks clean, leaving the model with no evidence of what did not work.
alternative-toStep Budget★★— Cap the number of tool calls or loop iterations the agent is allowed within a single request.
complementsAutomating a Broken Process✕— Anti-pattern: deploy agents on top of a workflow that is already dysfunctional, so the dysfunction is amplified at machine speed instead of resolved.
complementsAgentic Debt✕— Anti-pattern: deploy agents on top of an unconsolidated data foundation, weak governance, or missing MLOps infrastructure, so every subsequent capability — observability, retraining, compliance retrofit — pays compounding interest on the skipped foundational work.
alternative-to[evaluation-driven-development]
alternative-toSilent Pilot-to-Production Promotion★— Anti-pattern: let a well-performing pilot quietly expand in scope until it is a de facto production decision system, while keeping the 'pilot' label so it never trips the go-live governance gate.