Demo-to-Production Cliff
Anti-pattern: ship a demo-validated agent straight into production without a frozen eval, cost ceiling, loop-detector, or named oncall, then act surprised when accuracy drops and cost runs away.
Problem
Demo conditions hide most of what kills agents in production. Latency at low concurrency does not predict p99 under load. A 95% pass rate on a hand-picked eval does not predict accuracy on the long tail. Token spend on a few demo turns does not predict the cost of an undetected recursive multi-agent conversation running overnight. Industry surveys (88% of agents never reach production; 70–95% failure rate among those that do) consistently attribute the gap to missing evaluation infrastructure, monitoring, dedicated ownership — not to model quality. The t3n analysis names this directly: it is not the model that fails, it is the architecture around it.
Solution
Treat the demo as the beginning of evaluation, not its conclusion. Stand up an eval harness with a frozen rubric before production traffic; gate deploys on it. Add cost-observability per agent-run and a hard budget ceiling per session. Add loop-detection (typed-tool-loop-detector or step-budget) to catch recursive multi-agent chatter. Replay production traffic in a shadow-canary before promotion. Name an oncall for the agent system the same way as for any other production service.
When to use
- Never. Cite this anti-pattern when reviewing the deploy plan of any agent that has only been validated against a curated demo set.
- Demand a frozen eval, cost ceiling, loop-detector, and named oncall before sign-off.
- Stage at production-scale shadow traffic, not demo-scale.
Open the full interactive page →
Diagram, neighbourhood map, code examples, related patterns and full provenance.