How We Stopped Our AI Agents From Getting Dumber Mid-Session

"Task 50 should have the same quality as Task 1. If it doesn't, your architecture is wrong."

The Moment I Noticed

I watched Atlas — the founder agent on our Halo investment tracker — build a perfect Prisma database schema. Clean types, proper relations, sensible defaults. Forty minutes later, in the same session, he wrote a UI component that completely ignored the schema he'd just created. Different field names. Wrong relationships. A portfolioId where the schema used holdingGroupId.

Same session. Same agent. Same context window.

What happened?

Context Rot Is Real

Every AI model operates within a context window — a fixed amount of text it can hold in working memory. Claude has 200k tokens. GPT-4 has 128k. Within that window, the model reasons over everything: your instructions, the codebase it's read, the code it's written, the conversation so far.

Here's what nobody tells you about that window: quality degrades as it fills.

0–30% utilization: Peak quality. The model has room to reason. Decisions are coherent. Code is clean.
30–50%: Still good, but the model starts taking shortcuts. Less thorough error handling. Fewer edge cases considered.
50–70%: Rushing. The model "knows" it's running out of room (not consciously, but the compression effects are measurable). It starts cutting corners — skipping validation, reusing patterns even when they don't fit.
70%+: Hallucinations spike. The model starts contradicting its own earlier decisions. It forgets constraints it established 20 minutes ago. This is where Atlas built a UI that ignored his own schema.

Quality degrades as the context window fills — not linearly, but exponentially after 50%.

This isn't theoretical. We see it in every project. Late-session code is measurably worse than early-session code. The same agent that writes elegant, well-tested code at the start of a session produces sloppy, assumption-heavy code at the end.

I was essentially asking my best engineer to code for 8 hours straight without a break, without notes, holding the entire codebase in their head — and then wondering why hour 7 looked different from hour 1.

How We Used to Work

Our agents — Atlas on Halo, Hopper on Commish Command — would receive a feature and work it in a single long session. Sometimes an hour. Sometimes two. The session would start clean: read the state, understand the requirements, plan the approach.

By the end, the agent was writing code that contradicted decisions made 30 minutes earlier. Not because it was a bad agent. Because the context window was full, and the early decisions — the schema choices, the architectural patterns, the naming conventions — were being compressed and degraded to make room for the latest code.

We were burning through context like it was infinite. It isn't.

The Fix: Four Patterns That Changed Everything

We extracted four patterns from the GSD framework — an open-source spec-driven development system for Claude Code by Lex Christopherson. These patterns solve the biggest quality gap in multi-agent systems: context degradation in long sessions.

We codified them as HWW-1.6 — the latest version of our "How We Work" operating standard. Here's what changed.

Pattern 1: Fresh Subagent Contexts

This is the single biggest architectural win.

Instead of one agent working a complex feature in one long session, the lead agent decomposes the work into tasks and spawns a fresh subagent for each one. Each subagent gets a clean 200k context window. It reads only what it needs — the relevant schema, the specific component, the acceptance criteria for its task — and builds with full headroom.

The orchestrating agent stays at 30–40% context utilization. It plans and coordinates. The subagents build. Task 50 has the same quality as Task 1 because Task 50 runs in a context window that's just as clean as Task 1's was.

The orchestrator plans at low utilization. Subagents build in fresh contexts.

The analogy: instead of one developer working an 8-hour marathon, you have a project lead handing well-scoped tasks to fresh developers who each work a focused 45-minute sprint. The project lead never loses the big picture. The developers never lose quality.

How to steal this: Tell your Claude Code or Cursor agent to use subagents for multi-step work. Break complex features into independent tasks and let each one run in a fresh context. Or install GSD (npx get-shit-done-cc@latest) to get this behavior automatically.

Pattern 2: Aggressive Atomicity

Fresh subagents only work if each task is small enough to fit comfortably in a clean context window. Our rule: no task should consume more than ~50% of a fresh context window. In practice, that means:

Max 2–3 acceptance criteria per task
~30 minutes of focused agent work
Each task gets its own atomic git commit

Before this pattern, a task looked like: "Build the portfolio dashboard." That's a session-killer. The agent reads the entire codebase, builds the data layer, then the API, then the UI, then the tests — and by the time it gets to tests, it's forgotten constraints from the data layer.

After: the same feature becomes 6–8 atomic tasks. "Create the holdings data model." "Build the API endpoint for portfolio summary." "Implement the holdings table component." "Add sorting and filtering." Each one independently testable. Each one independently revertable with git bisect.

How to steal this: If your prompt to an AI agent is longer than a paragraph, you're probably asking too much. Decompose until each task is obvious. If you can't describe the acceptance criteria in three bullet points, the task is too big.

Pattern 3: The Discuss Phase

Here's the productivity killer nobody talks about: an agent blocks mid-sprint to ask a design question. Or worse — it guesses wrong and builds something reasonable but not what you wanted.

The discuss phase happens before building starts. The lead agent reviews the requirements, identifies gray areas, and asks the human about them upfront:

Visual features: "Cards or table rows? Dark mode support?"
API design: "REST or RPC? Error response format?"
Data modeling: "Nullable fields? Cascade deletes?"
UX decisions: "Pagination or infinite scroll? Mobile-first?"

All human decisions get captured in a CONTEXT.md file that feeds into task planning. By the time the agent starts building, there are no open questions. No blocking. No wrong guesses.

This is the pattern that surprised me most. I assumed the agents should just figure it out — they're smart, they can make reasonable decisions. And they can. But "reasonable" isn't the same as "right." The discuss phase eliminated an entire category of rework: code that was well-built but wrong because it assumed something the human would have answered differently.

How to steal this: Before giving an AI agent a big task, spend 5 minutes answering "What are the gray areas?" Write your preferences down. The agent will build exactly what you imagined instead of something reasonable but wrong.

Pattern 4: Goal-Backward Verification

The old way to verify a feature: "Did we complete all the tasks?" This is the wrong question. Tasks can be "complete" while the feature is broken.

The new way: "What must be TRUE for this to work?"

Instead of checking task completion, you define observable behaviors that must hold:

❌ "Did we build the sort function?" → checks effort, not outcome
✅ "Does clicking the column header sort the data ascending on first click and descending on second click?" → checks what the user actually experiences
❌ "Did we add the API endpoint?" → checks existence, not correctness
✅ "Does GET /api/holdings return the correct portfolio with calculated totals within 200ms?" → checks behavior

The verification criteria get written before building starts (during the discuss phase). The agent tests against them after building finishes. If a behavior fails, the task isn't done — regardless of what the code looks like.

How to steal this: After your AI agent finishes a feature, write down 3–5 things that must be true. Test those. Don't just check if the files exist.

The Bonus: Wave Execution

Once you have atomic tasks with clear dependencies, a natural optimization emerges: independent tasks can run in parallel.

Wave 1 handles all tasks with no dependencies — the data models, the utility functions, the independent components. Wave 2 handles tasks that depend on Wave 1 outputs. Wave 3 builds on Wave 2. Each wave runs its tasks concurrently.

Wave 1: [Task A] [Task B] [Task C]  ← no dependencies, run in parallel
              ↓       ↓
Wave 2:    [Task D] [Task E]         ← depend on A and B
                    ↓
Wave 3:          [Task F]            ← depends on D and E

This is how you go from "my agent works serially" to "my agents work like a team." A feature that took one agent 90 minutes in a single degrading session now takes 30 minutes across three waves of fresh subagents.

What We Kept From Our Old Model

I want to be honest about something: we didn't install GSD wholesale. GSD is designed for solo developers running Claude Code. We run a multi-agent startup studio with specialized agents — Atlas is not Nova is not Jenny. They have different skills, different prompts, different domain knowledge.

What we did was extract the patterns and wire them into our existing orchestration. Fresh subagent contexts work within our builder-who-triages model. Aggressive atomicity complements our FEATURES.yaml contracts. The discuss phase integrates with our session startup protocol. Goal-backward verification strengthens our existing testing patterns.

Fresh subagents + our named-role specialization = better than either alone. GSD gives us the context engineering. Our model gives us the domain expertise. The combination is what makes Task 50 as good as Task 1 and as architecturally coherent as if one expert had planned the whole thing.

Full credit to Lex Christopherson and the GSD framework. We extracted patterns; he built the system that proved them. The GSD repo is worth reading even if you don't use Claude Code.

What Changed

We're rolling this out across all projects now. The early results are clear even without formal metrics:

Late-session code quality matches early-session quality. No more "hour 7 syndrome."
Features ship in smaller, revertable commits instead of monolithic session dumps.
Design decisions are captured before building starts, not discovered during review.
Agents ask better questions upfront and make fewer wrong assumptions.

I'll update this post with before/after data once we have enough sessions to measure properly. But the qualitative difference is obvious to anyone who's watched an agent degrade at minute 45 of a long session — and then seen the same agent produce clean, coherent code at minute 45 because it's working in a fresh context window.

The Takeaway

If you're building with AI agents, the single biggest upgrade you can make is this: stop asking one agent to do everything in one session.

Decompose the work. Spawn fresh contexts. Capture decisions before building. Verify behaviors, not tasks. The quality difference is not incremental — it's the difference between an agent that degrades and an agent that doesn't.

Context engineering isn't a nice-to-have. It's the foundation that makes everything else work.

This is the latest evolution of the MonkeyRun operating model — HWW-1.6. It builds on our builder-who-triages pattern and file-based coordination system. See The Model for how the full system works, or Patterns for more battle-tested learnings.

The GSD framework by Lex Christopherson is open source: github.com/gsd-build/get-shit-done. If you're using Claude Code, it's worth trying directly.