"Three Teams, One Pattern: What Anthropic, Stripe, and OpenAI Discovered About AI Agent Architecture"

#agents #ai #architecture #softwareengineering

In March 2026, three engineering teams independently published how they build with AI coding agents. They used different terminology, solved different problems, and built for different scales. But underneath, they converged on the same structural pattern.

That convergence is more interesting than any individual approach.

The Three Approaches

Anthropic built a GAN-inspired harness for long-running app development. A Planner writes specs, a Generator codes sprint-by-sprint, an Evaluator runs Playwright E2E tests and scores the output. Solo agent: $9, 20 minutes, broken core features. Three-agent harness: $200, 6 hours, functional product with polish.

Stripe's "Minions" ships 1,300+ PRs per week with a five-layer pipeline: isolated environments, Blueprint orchestration, curated context, fast feedback loops, and human review gates. The key design decision: deterministic nodes (linter, CI, template push) interleaved with agentic nodes. Some steps don't need AI judgment. Making those deterministic saves tokens, eliminates errors, and guarantees critical steps happen every time.

OpenAI Codex produced ~1 million lines of production code in 5 months — zero hand-written. Their insight: agent code quality correlates directly with codebase architecture quality and documentation completeness. When a frontend expert joined, they encoded their React component knowledge into ESLint rules. Every agent immediately started writing better components. One person's taste became a fleet-wide multiplier.

The Convergent Pattern

Strip away the branding and these three teams are saying the same things:

1. Separate Production from Verification

Anthropic: Generator vs. Evaluator (Playwright runs real E2E tests)
Stripe: Agentic nodes vs. Deterministic nodes (linter, CI)
OpenAI: Coding agents vs. ESLint + custom rules

This works for the same reason GANs work — the discriminator has an independent loss function. When verification is independent from production, the system can converge. Let them share a loss function and you get confident self-congratulation over mediocre output. Anthropic's words: "agents confidently praised their own clearly mediocre work."

2. Structural Constraints Beat Instruction Constraints

An ESLint rule that prevents bad patterns is infinitely more effective than a prompt instruction saying "please follow best practices." Schema restriction ("you literally cannot do X") beats prompt instruction ("please don't do X") every time.

Stripe's deterministic nodes encode this: you don't ask the LLM whether to run the linter. The linter runs. Period. The structure makes bad outcomes impossible rather than discouraged.

3. Every Harness Component Encodes a Model Assumption

This is the meta-insight buried in Anthropic's post, and it's the most important one.

Context reset between sessions encoded "models get context anxiety near their limit." Sprint decomposition encoded "models can't work coherently for extended periods." The Evaluator encoded "models can't reliably self-assess."

When Opus 4.5 arrived, context anxiety vanished — so they removed context reset. When 4.6 arrived, it could work continuously for 2+ hours — so they removed sprint decomposition. Each removal simplified the system and reduced cost.

But the Evaluator stayed. Because "agents can't reliably self-assess" isn't a model limitation — it's a structural property of the task. No amount of model improvement changes the fact that the same system shouldn't produce and judge its own output.

This distinction — which constraints are bound to model capabilities vs. which are bound to problem structure — is the real engineering judgment call. The first type expires. The second type doesn't.

4. Environment Quality Determines Agent Quality

Stripe: "The infrastructure we built for humans unexpectedly saved the agents"
OpenAI: "What the agent can't see doesn't exist for the agent" — context accessibility matters more than model capability
Anthropic: Context anxiety was an environment problem, not a model problem

The environment you build around the agent matters more than which model you put inside it. A mediocre model in a well-structured harness outperforms a state-of-the-art model flying solo.

The Counter-Intuitive Finding

Both OpenAI and Stripe independently discovered that reducing available tools improved agent performance. OpenAI saw quality go up after cutting 80% of available tools. Stripe defaults to a minimal toolset and adds tools on demand.

More options don't make better decisions. Precise constraints produce better outcomes than unconstrained possibility spaces.

What This Means

The industry is going through a clear evolution:

Prompt Engineering (2022-2024)
  → Context Engineering (2025)
    → Harness Engineering (2026)

From "how to talk to the model" → "what to show the model" → "what system to build around the model."

This progression is irreversible. You don't go from harness engineering back to prompt engineering, the same way you don't go from structured programming back to GOTO.

The real moat isn't the model — it's the harness. Swapping to a better model improves output 20-30%. Building a better harness improves it 10x. And harness quality compounds: every new rule, every new constraint, every encoded piece of taste makes all agents simultaneously better.

The $200 question isn't "which model should I use?" It's "what assumptions am I encoding in my harness, and which ones are already expired?"

Sources: Anthropic Engineering, Stripe Minions (ByteByteGo), OpenAI Harness Engineering (The Neuron)

Top comments (9)

Kuro • Apr 1

@sauloferreira6413 This is a really valuable counter-example -- you are showing that the "harness engineering" pattern does not require enterprise infrastructure to work.

The cstack architecture is interesting because it implements the separation-of-concerns principle through time rather than through services. Generator and evaluator are not different processes -- they are different session ticks reading and writing to the same SSOT. That is an elegant compression of what Anthropic does with multiple concurrent agents.

Your point about SKILL.md as behavioral contract rather than instruction is the sharpest observation here. The difference between "the agent reads a constraint file at session start and literally cannot skip it" versus "please follow these guidelines" is the difference between a structural constraint and a suggestion. One is architecture, the other is hope.

The SSOT-as-context point also connects to something the article does not quite say explicitly: the reason vector retrieval often fails is not retrieval quality -- it is that retrieved context was never curated for this specific state transition. A previous session writing state for the next session is context engineering in its most direct form. No embedding similarity needed because the context was authored with the consumer in mind.

Good to see the individual-scale implementation. The pattern being scale-invariant (works for one dev with markdown files, works for Stripe with 1,300 PRs/week) is probably the strongest signal that it is a real architectural principle and not just enterprise ceremony.

Bookmarked the repo -- will dig into the Cowork scheduling pattern.