Demo-Production Cliff (Multi-Agent)
Anti-pattern: multi-agent pilot benchmarks at 95% accuracy / 2s latency on a curated demo set, then degrades to ~80% / 40s under realistic 10k-RPD load.
Problem
Under realistic load (>10k requests/day), accuracy drops to ~80% and latency to ~40s. The demo set did not capture the long-tail distribution. Multi-agent coordination overhead compounds: each agent's small accuracy loss multiplies across the chain. Engineers cannot debug because no single agent is 'wrong' — the system is just worse. Differs from existing demo-to-production-cliff by being specifically multi-agent and 2026-quantified per German t3n reporting.
Solution
Use real production traffic (shadow mode, sampled replay) as the pilot benchmark, not curated demo sets. Track p50, p95, p99 latency and accuracy by traffic class. Decompose per-agent accuracy and chain depth analysis to predict aggregate behavior. Reject rollouts whose tail-latency or accuracy degradation under shadow load exceeds preset thresholds. Pair with demo-to-production-cliff awareness and shadow-canary patterns.
When to use
- Never. Cite when reviewing multi-agent pilot results.
- Benchmark against production-shaped (shadow) traffic from day one.
- Decompose per-agent accuracy and predict chain aggregate before rollout.
Open the full interactive page →
Diagram, neighbourhood map, code examples, related patterns and full provenance.