Perma-Beta
also known as Forever Beta, Eval Vacuum
Anti-pattern: ship the agent in 'beta' indefinitely so that quality regressions are someone else's problem.
Context
A team launches an agent product to real users under a 'beta' label, without building an evaluation harness that can measure quality regressions across releases. Months later, the product is still labelled beta, partly because the team genuinely has not measured quality, partly because removing the label would commit them to a quality bar they have no way to defend. The label has quietly shifted from a signal of active iteration to a shield against accountability.
Problem
Without an evaluation harness, every release is a guess: regressions land invisibly, model upgrades are accepted or rejected on vibe, and customer-facing quality drifts without anyone noticing until churn reveals it. Beta becomes a permanent excuse that costs nothing to keep and absorbs all accountability for unmeasured quality. Eventually a competitor ships a graduated version of a similar product and the beta team discovers, too late, that they never had a measurement story.
Forces
- Eval harnesses cost time to build.
- GA promises commit to quality bars.
- Beta lets product move fast.
Example
A startup launches its agent product as 'beta' and uses the label as a blanket excuse for any quality complaint. Eighteen months later the agent is still beta, there is no eval harness, and customers have started churning to a competitor that ships GA. The team names the failure perma-beta and forces an exit: build the eval suite, set quality gates, fix the regressions blocking GA, and remove the beta label. The label was hiding the fact that nobody actually knew whether the product was getting better or worse.
Diagram
Solution
Therefore:
Don't. Build the eval harness and exit beta. See eval-harness, llm-as-judge, shadow-canary.
What this pattern forbids. By definition, this anti-pattern imposes no useful constraint; the missing constraint is the failure mode.
And the patterns that stand alongside it, or against it —
- alternative-toEval Harness★★— Run a held-out dataset against agent versions to detect regressions and measure improvement.
- alternative-toShadow Canary★★— Run a candidate agent version in shadow alongside the champion, comparing outputs without affecting users.
- conflicts-withEval as Contract★★— Treat the eval suite as the contract the agent must satisfy; releases ship only if evals pass.
- complementsDemo-to-Production Cliff✕— Anti-pattern: ship a demo-validated agent straight into production without a frozen eval, cost ceiling, loop-detector, or named oncall, then act surprised when accuracy drops and cost runs away.
- complementsAutomating a Broken Process✕— Anti-pattern: deploy agents on top of a workflow that is already dysfunctional, so the dysfunction is amplified at machine speed instead of resolved.
- complementsAgentic Skill Atrophy✕— Anti-pattern: let agents take over routine architectural and debugging decisions in code until developers no longer form the implicit knowledge that lets them review the agent's output or recover when it fails.
- complementsAgentic Debt✕— Anti-pattern: deploy agents on top of an unconsolidated data foundation, weak governance, or missing MLOps infrastructure, so every subsequent capability — observability, retraining, compliance retrofit — pays compounding interest on the skipped foundational work.
- alternative-toRigor Relocation★— Relocate verification rigor from the model loop to surrounding scaffolding (evals, judges, decision logs, policy gates) so failures are caught by the wrapper rather than the agent.
- complementsHidden Validation-Work Amplification✕— Anti-pattern: an agent rollout shifts effort from doing the work to validating, monitoring, and recalibrating the agent — net productivity is negative because the hidden human evaluation burden exceeds the visible automation gain.
Neighbourhood
Click any neighbour to follow the language. Scroll to zoom, drag to pan.