The engineering leader who green-lights a 60% pass@1 agent owns the 3am page when an upstream API returns a 429 and the orchestration silently retries into a corrupted cart. Re-run the same task ten times and that 60% can collapse to 25%. The benchmark number was never the SLA. It was a single coin flip in a suit.
Three pieces of early-2026 research now put hard numbers on what practitioners already suspected. This post is what to measure instead, and how to instrument it before the first customer touches the system.
Pass@1 is a capability metric. Production needs a reliability metric.
ReliabilityBench (Gupta, 2026) evaluates tool-using agents under production-like stress: perturbed inputs, injected faults, repeated runs. Perturbations alone drag agent success from 96.9% to 88.1%. Injected rate limits are the single most damaging fault category. The authors are blunt about why existing leaderboards mislead.
— Gupta — ReliabilityBench (2026)Today's leaderboards miss reliability properties required in production.
Khanal et al. formalise the gap with pass^k, the probability that an agent succeeds on all k repeated runs of the same task. If episodes were i.i.d. Bernoulli, pass^k = p^k. Across 23,392 episodes, frontier models hold above 80% pass@1 on short tasks but drop to 52% pass^k on long-horizon ones. The leaderboard cannot see that drop. Your pager will.
Sit with the pass^k math for a moment, because the cliff is steeper than most leaders read it as. Assume tasks are independent (they are not; they are usually worse) and that pass^k = p^k. At p = 0.9, your headline pass@1, pass^5 is 0.59 and pass^10 is 0.35. At p = 0.8, pass^5 is 0.33 and pass^10 is 0.11. A 90% agent that needs to complete the same workflow ten consecutive times for ten different users is wrong outright two-thirds of the time across the day. An 80% agent is wrong almost nine times in ten. The leaderboard says ship. The arithmetic says do not.
The Princeton group makes the same point from a theoretical angle. Accuracy cannot tell a predictable failure from an unpredictable one.
— Rabanser et al. — Towards a Science of AI Agent ReliabilityThe former permits systematic debugging while the latter does not.
For an engineering leader, that is the difference between a roadmap and a permanent on-call rotation.
The real failure mode is deterministic interfaces meeting probabilistic outputs.
Shah et al. mined 385 faults from 40 open-source agentic repos and validated them against 145 practitioners. Five architectural fault dimensions, mapped to propagation pathways. The central finding: failures arise not only from faulty code or model hallucinations, but also from agent orchestration, evolving internal state, and interactions with environmental feedback.
Three of those dimensions deserve names, because they are the ones your team has probably never written a test for.
- Orchestration faults: the planner picking the wrong tool, calling tools in the wrong order, or losing track of which sub-agent owns which step.
- State-evolution faults: the agent's internal memory drifting away from reality across turns. A cart that quietly de-syncs from inventory. A "user preference" that mutated three steps ago and was never re-read.
- Environmental-feedback faults: the agent acting on tool responses that are valid by schema and wrong by meaning. An empty list interpreted as "no results" when it was actually a rate-limit fallback.
None of these show up as model errors. All of them show up as user-visible bugs.
Translate that for the deck you have to present to the CFO. It is not "the model hallucinated." It is "the model emitted a valid-looking JSON that the downstream tool parsed and acted on." Clean benchmarks never surface this because the harness validates and retries silently. Production does neither.
The downstream blast radius is the part that ends careers. The planner picks tool B when the prior step's state implied tool A, the orchestrator has no contract to catch it, and the side effects are real: a refund issued, a record updated, a ticket closed. By the time the eval suite would have flagged anything, the row in the database is wrong and the customer has already screenshotted it.
AWS arrives at the same conclusion from inside the building: agentic AI systems require a fundamental shift in evaluation methodologies that assess not only the underlying model performance but also the emergent behaviours of the complete system. Score tool selection, multi-step reasoning coherence, and error recovery. Score the system, not the model.
Three dimensions to instrument: consistency, robustness, fault tolerance.
Consistency. Run every eval task at least ten times. Track pass^k alongside pass@1. Surface variance, not just the mean. Khanal et al. give you four diagnostics that drop into a dashboard.
| Diagnostic | What it tells you |
|---|---|
| Reliability Decay Curve | How fast pass^k falls as k grows. |
| Variance Amplification Factor | Whether failures cluster on the same tasks or scatter across them. |
| Graceful Degradation Score | Whether partial success is partial or catastrophic. |
| Meltdown Onset Point | The k at which the agent stops being a product and starts being a liability. |
Put the four on a dashboard and stop arguing about whether the agent "works."
Robustness. Perturb the inputs the agent sees. Paraphrase user requests. Reorder tool descriptions. Swap synonymous field names in the function-calling schema. ReliabilityBench's ~9-point drop under perturbation is the budget your harness has to beat. A system that survives the canonical wording and collapses on a synonym is not robust; it is overfit to the demo script. Most production traffic does not match the demo script. Most production traffic is the demo script's evil twin.
Fault tolerance. Inject what production will inject: 429s, 5xx, schema drift, partial tool responses, latency spikes, truncated context windows. Score recovery, not survival. Tests that only check whether the agent eventually returned an answer miss the corrupted-cart class entirely.
The taxonomy's propagation pathways tell you which faults cascade and which die quietly, and the cascading ones are the checklist you actually need.
- Orchestration faults propagate furthest, because every subsequent step inherits the wrong plan.
- State-evolution faults are the most expensive to debug, because the visible symptom appears turns after the root cause.
- Environmental-feedback faults are the most likely to silently corrupt downstream systems, because the agent does not know it failed.
A fault-injection harness that does not stress these three pathways is testing for the easy cases. Build the harness so each pathway has at least one canary scenario, and run them every night.
Fold Anthropic's eval-engineering discipline into the same loop. Combine code-based, model-based, and human graders. Seed eval sets from real production failures. Treat the eval suite as a maintained system, not a one-off.
— Anthropic — Demystifying evals for AI agents (2026)Good evaluations help teams ship AI agents more confidently. Without them, it's easy to get stuck in reactive loops, catching issues only in production.
// The minimum eval contract per task: enough metadata to reconstruct *why* a run passed or failed.
type EvalRun = {
taskId: string
trial: number // 1..N, N >= 10
perturbation?: "paraphrase" | "reorder-tools" | "rename-fields"
inject?: "rate-limit" | "5xx" | "schema-drift" | "partial-response"
passed: boolean
recoveryPath: "none" | "retry" | "fallback" | "user-handoff"
toolTrace: { step: number; tool: string; ok: boolean }[]
}
// pass^k for a task = product of `passed` across k trials with identical seed
// pass@1 for a task = mean(passed) across trialsThe contract is deliberately small. The point is that every run carries enough metadata to reconstruct why it passed or failed: which perturbation, which injected fault, which recoveryPath, which tool step. A pass/fail boolean is a leaderboard. The shape above is a debugger.
How DAD builds this in: the harness ships before the agent.
We build the eval harness first. Adversarial pipelines with fault injection are the dev loop, not a pre-launch gate. Every PR runs the suite. A regression in pass^k blocks merge even when pass@1 improves. Capability gains that trade away reliability are not gains.
Three concrete defaults we set on every agentic build:
- N=10 minimum per eval task.
pass^ktracked in CI alongsidepass@1, both reported in the PR body. A drop inpass^kof more than 5 points fails the check, even whenpass@1ticks up. - Fault-injection middleware in front of every external tool call. Default failure modes are
429,5xx, schema drift, and latency above the upstream p99. The harness flips them on for a configurable share of trials and labels each run so the failure mode is recoverable from the trace alone. - Production-seeded eval set. Every real incident becomes a new eval case within a week. The suite grows as the system meets the world. The eval set at month six looks nothing like the eval set at launch, and that is the point: the harness is a record of everything that ever surprised us.
This extends the operating discipline we already publish in /blog/production-ai-checklist on security, audit, and monitoring, with the reliability column the 2026 research now forces.
What to do on Monday morning.
Pick one agent you are about to ship. Take its top 20 eval tasks. Re-run each one ten times. Compute pass^k. If pass^k sits more than 15 points below pass@1, hold the launch and build the fault-injection harness before the next feature. The number you will lose by waiting is smaller than the number you will lose by paging at 3am for six months.
References
- ReliabilityBench — Gupta (2026)
- Beyond pass@1 — Khanal et al. (2026)
- Towards a Science of AI Agent Reliability — Rabanser et al. (2026)
- Agentic AI fault taxonomy — Shah et al. (2026)
- AWS — Evaluating AI agents: real-world lessons from building agentic systems at Amazon (2026)
- Anthropic — Demystifying evals for AI agents (2026)