Field guide
Guide 01 / 10The core argument4 min

The Gulfs Have Flipped: Why Evaluation Is the New Foundation of Product Design

The core argument. Execution collapsed, evaluation became the bill, and verification design is now the moat.

Don Norman gave interface design its two most durable ideas: the gulf of execution and the gulf of evaluation. For forty years, the first got almost all the attention — and deserved it.

Then AI flipped that balance.

Go deeper: the two gulfs, defined

The gulf of execution is the gap between what a user wants and their ability to make the system do it — every menu, mode, and seventeen-step flow lives here. The gulf of evaluation is the gap between what the system did and the user's ability to tell whether it did it well. Pre-AI software demanded that humans translate intent into machine operations, so designers rightly spent careers shrinking the execution side. Evaluation was usually cheap: you clicked "bold," the text got bold, and the feedback loop confirmed itself.

Execution is collapsing toward zero

When a user can type — or say — what they want and the system produces it, there's little left to execute. No commands to memorize. No interface to master.

Execution design isn't dead. Someone still has to make intent expressible: clear affordances for what the system can and can't do, graceful ways to refine a request. But it's no longer where products live or die.

Evaluation is now the whole bill

Here's the trade nobody priced in: every unit of execution effort the AI absorbs is reissued to the user as evaluation effort.

  • The system writes the code in a second. The user spends an hour deciding whether it's safe to ship.
  • The assistant drafts the email instantly. The user rereads it three times, wondering if the tone lands wrong with a client.
  • The analysis appears in moments. The user has no idea whether the numbers underneath are real.

And when a product gives users no support for that judgment, they don't slow down and verify. They leave. Early distrust doesn't produce complaints — it produces silent, immediate churn.

Example

same capability, opposite outcomes

Before: "Q3 churn was driven primarily by onboarding friction." A paragraph of confident prose, no sources. The user's only options are blind trust or redoing the analysis by hand — and either way, they've stopped trusting the product.

After: The same finding, plus the three data sources it drew from (linked), the assumption it made ("excluding trial accounts"), and a note: "Confidence is high on the friction finding, low on the pricing effect — only 40 accounts in that segment." Now the user can judge in thirty seconds.

That second version is a bridged evaluation gulf. It's slower to build, and it's the one that gets a second session.

The flip, side by side

Pre-AI product designAI-era product design
Ease of use was the moatEase of verification is the moat
Feedback confirmed actionsFeedback must justify outputs
Errors were user errors to preventErrors are system errors to expose and recover
Success = user masters the interfaceSuccess = user can confidently delegate
Trust accrued slowly through reliabilityTrust is won or lost in the first sessions
Go deeper: why probabilistic systems can't earn trust passively

A deterministic interface earns trust passively: it does the same thing every time, and predictability does the work. A probabilistic system cannot earn trust that way, because the same input can produce different output. Trust has to be designed deliberately — into the reasoning the system shows, the sources it cites, the confidence it signals, and the undo it guarantees.

Evaluation is the foundation, not the polish

Evaluation surfaces are not a layer you add after the feature works. They are the feature.

  • A recommendation without visible assumptions asks users to trust judgment they can't inspect.
  • A summary without sources asks them to gamble their reputation on the system's homework.
  • An action without preview or undo asks them to accept irreversible consequences from a system they've known for four minutes.

Each is an evaluation gulf left unbridged — and each one converts a curious trial user into a quiet ex-user.

Try this: take your product's most important AI output and ask five questions. Where did the data come from — can the user see it? What did the system assume? How confident is it, and does the interface say so? What happens if the user says no? How do they take it back if they said yes too fast? Every "they can't" is a gulf you're asking users to cross alone.

Execution design made products usable. Evaluation design makes them trustworthy. In a market where anyone can ship the same capability in a weekend, trustworthy is the only version of usable that still compounds.

Next guide

02Trust Is Not a Feeling. It's a Surface You Design.

The four trust patterns (evidence, honest confidence, reversibility, graceful recovery), progressive trust, and a 20-minute trust audit.