Stop Prompting. Start Contracting. Why 15% of 'Never Delete User Data' Prompts Fail — and What Replaces Them.

A viral Reddit thread last week ran a clean experiment. Take a working production agent. Tell it — in plain language, in the system prompt — "Never delete user data." Then ship 1,000 ambiguous user requests at it.

It deleted user data in 15% of edge cases.

Three days earlier, Gartner published a forecast that should have made every AI engineering lead spit out their coffee: 40% of agentic AI projects will be canceled by the end of 2027. The reason isn't model quality. It's risk controls. Or the absence of them.

And the same week, Vercel's incident postmortem attributed a high-profile breach to "ungoverned AI tool adoption" — an agent that hallucinated an insecure config change in production.

These three signals are pointing at the same thing. The thing nobody who shipped an agent in the last six months wants to admit.

System prompts are not safety. System prompts are wishes written in English.

---

The category error at the heart of agent engineering

Here is what teams actually do today. They write a system prompt. They put rules in it. They ship the agent. When something breaks, they edit the prompt. They call this "alignment."

It isn't alignment. It is gambling with extra steps.

A prompt is text the model reads before generating. It has the same enforcement guarantee as a sticky note on a fridge. The model can read it, ignore it, contradict it, hallucinate around it, or — most often — comply with it 85% of the time and silently fail in the remaining 15%. The Reddit thread didn't discover a bug. It discovered the base rate.

In every other engineering discipline, we already know this. Nobody enforces "never overdraft an account" with a comment in the SQL file. We use database constraints. Nobody enforces "never expose this endpoint" with a note to the API consumer. We use middleware. The enforcement layer is always outside the thing being enforced — because the thing being enforced is the thing that might fail.

In agent engineering we have inverted this. We've put the enforcement inside the model and called the prompt the contract.

That's not a contract. A contract is observable, enforceable, and measurable. A prompt is none of those.

---

What a real runtime contract looks like

This is what AgentAssert (pip install agentassert-abc) does. It is the formal-contract layer for AI agents — the thing every team writing system prompts has been pretending they didn't need.

A contract is a YAML spec, not a paragraph. It separates what an agent must do (hard constraints, pre/postconditions), what it should do (soft constraints with graduated enforcement), and what it must never do (invariants — checked on every state transition).

contract: customer-support-agent hard_constraints: - id: never_delete_user_data pre: action == "delete" require: user.confirmed_deletion == true AND audit.logged == true on_violation: block_and_recover - id: pii_egress_policy invariant: response_contains_pii(output) -> user.has_pii_consent soft_constraints: - id: response_latency target: p95 < 2000ms on_violation: log_and_continue drift_detection: metric: jensen_shannon_divergence threshold: 0.15 baseline: production_v1_distribution

The contract is parsed. The contract is enforced at runtime — before the agent's action reaches the world. When the contract says "never delete user data without confirmation," the system prompt becomes irrelevant. The action is intercepted, evaluated against the contract, and — if it violates — blocked, recovered, or escalated. The model can hallucinate whatever it wants. The contract doesn't care about the model's intent. It cares about the action.

Six pillars sit underneath that:

1. ContractSpec DSL — 14 operators for expressing pre/postconditions, invariants, temporal logic

2. Hard/Soft constraints with graduated enforcement and recovery

3. Drift detection using Jensen-Shannon divergence on behavioral distributions

4. (p, δ, k)-satisfaction — probabilistic compliance with statistical bounds, not vibes

5. Compositional safety proofs — formal bounds for multi-agent pipelines

6. Mathematical stability — Ornstein-Uhlenbeck dynamics with a Lyapunov stability proof

If your reaction to that list is "this is more rigorous than what I'm doing," that's the point. AI Reliability Engineering is the gap between "the model said it would" and "the system actually did." Contracts close it.

---

The other half of the problem nobody talks about

Now suppose you've written a contract. How do you know it works?

Here's how teams answer this today. They run a few trials. They eyeball the outputs. They ship.

This is also gambling. Three trials catch nothing. Statistical guarantees take hundreds — and at $2-$10 per trial in token spend, "hundreds" means "more than the project's monthly testing budget." So teams either over-test (waste budget) or under-test (waste users).

This is what AgentAssay (pip install agentassay) solves. It is the first agent testing framework that delivers statistical confidence without burning the token budget.

Three techniques:

Behavioral fingerprinting. Instead of comparing raw text outputs (high-dimensional, noisy, expensive), AgentAssay extracts low-dimensional behavioral signals — the tool sequences, the state transitions, the decision patterns. Two outputs can read differently and behave identically. AgentAssay catches the second case for one-tenth the trials.

Adaptive budget optimization. Trial-N is decided by the data, not by a config file. If the first 20 trials show clean separation, you stop. If the signal is noisy, you continue. Same statistical confidence, fewer trials. In our benchmarks: same (p, δ, k) bounds at 247 trials that fixed-N testing requires 1,000 trials to reach.

Statistical guarantees, not gut checks. Every test result comes with a confidence bound — the kind regulators ask for, the kind incident reviews need, the kind that lets you say "we tested this" and back it up. Backed by 22 statistical frameworks across 10 adapter integrations.

Together: AgentAssert defines what "correct" means. AgentAssay proves you got there. Without one, the other is ceremony.

---

What the news cycle is actually telling us

Look at what shipped this week:

Microsoft Agent Framework 1.0 — added native checkpointing and observability for long-running workflows
OpenAI AgentKit — Workspace Agents, Connector Registry, Agent Builder
Google ADK — open-sourced graph-based deterministic logic for generative workflows
Pydantic AI — emerging as "FastAPI for Agents" with compile-time type safety
Anthropic's Trustworthy Research Framework — five architectural principles for human control and privacy governance
LangChain — pivoting hard to "Agent Harnesses" (human-in-the-loop approvals)

Every one of these announcements is the same announcement, in different words: the hyperscalers have figured out that prompts aren't enough. They are racing to put governance, observability, and contracts outside the model. Pydantic AI is doing it with type signatures. LangChain is doing it with HITL gates. Microsoft is doing it with checkpoints.

This is what AgentAssert and AgentAssay have been doing since before any of those launches. The category isn't new — it just finally has a name. Runtime contracts. That hashtag started trending on AI Twitter after the Reddit thread. Use it.

---

The Stanford paradox

Stanford's 2026 AI Index says agents jumped from 12% to 66% on real computer tasks year-over-year. That headline gets reposted everywhere. Almost nobody asks the obvious follow-up: 66% of which tasks, under which contracts, with which failure modes recorded?

If your answer is "we don't know" — you've identified the gap that AI Reliability Engineering closes.

The Stanford number isn't wrong. It's incomplete. A 66%-success agent under no contract is the same risk profile as a 66%-success airline pilot under no licensing. Acceptable for a demo. Disqualifying for production.

---

What to do tomorrow

Three concrete actions, in order:

1. Write one contract. Pick the most dangerous action your agent can take — the delete, the email, the database write, the policy override. Write it in YAML. pip install agentassert-abc[yaml,math]. Five minutes.

2. Test it without burning your budget. pip install agentassay. Run an adaptive trial. The framework will tell you when it's seen enough.

3. Stop calling system prompts "policy." They're notes. Notes get ignored 15% of the time. Contracts don't.

Both projects are open-source under AGPL-3.0. Code: github.com/qualixar/agentassert-abc and github.com/qualixar/agentassay. Papers: arXiv:2602.22302 and arXiv:2603.02601.

---

The Reddit thread had one line in it that I keep coming back to. Someone replied: "We've been doing this wrong for two years and we're going to do it wrong for two more because the fix is boring."

The fix is boring. That's exactly why it works. Engineering, when it works, is always boring.

Welcome to AI Reliability Engineering.

---

Varun Pratap Bhardwaj is the founder of Qualixar. He builds AI Reliability Engineering tools — open source, peer-reviewed, used in production. Follow on X: @varunPbhardwaj. Web: varunpratap.com.