Why Your AI Coding Agent Gets Worse Over Time (and How to Fix It)
David Reis|March 10, 2026 (2h ago)
A common complaint I hear from engineers of all levels: their AI coding agent seems to get worse over time, the bigger the project gets. New changes break old features, requirements get forgotten, duplicate logic creeps in. The root cause is always the same. I've come to call it the capacity box.
The Capacity Box#
Every LLM has a capacity box: a two-dimensional limit defined by context length (how much information is in the conversation) and logical complexity (how hard the task is). Depending on the model's size and training, this box can be bigger or smaller.
When you push past the box, the model doesn't throw an error or warn you. Instead, it silently degrades, and starts generating incomplete code, cutting corners, and confidently producing something that's subtly wrong. Annoyingly, it doesn't tell you when you're asking too much of it. Instead, LLMs fail silently when you exceed their limits, and this is something that anyone working with these agents has to learn how to deal with.
Because different models have different capacity boxes, the same task can work perfectly with one model and fail with another. In the example below, this particular demand would overwhelm Haiku, and be well handled by Sonnet and Opus (with the latter being overqualified for the job).
The two dimensions of the box, context and complexity, are both things you must control. If you ask the model to do an extremely complex task with a short prompt, it can still struggle. If you ask it to do a simple task but the conversation has been going on for a while, it can also struggle. If you push hard on both dimensions you're almost guaranteed a bad result.
As a software engineer, it's your duty to steer your agent in a way that is mindful of these two axes. If you do it well, you'll be able to reliably use AI Agents in any codebase, no matter how extensive or complex it is. Let's dive into both dimensions, I'll share what has worked for me and what hasn't.
Managing context#
Context is the easier axis to manage. The goal is simple: make sure the current context window contains everything the agent needs to perform the task, and nothing more.
Use sub-agents for context isolation#
Claude Code, Cursor, and many others have a sub-agents feature, which allows the engineer (or the agent itself) to delegate tasks to specialized agents that run in isolated contexts. This is one of the best ways to optimize the capacity box, because the sub-agent does its work in its own context window and only returns a concise result to the main conversation.
I use sub-agents liberally for tasks that require reading a lot of code, searching through very large files, or generating a lot of intermediate output that would be irrelevant to the main conversation. Usage depends on the tool you're using, but there are two ways they are invoked: automatically by the main agent in some cases, or by your explicit request in your prompt.
Note that there's a lot of focus on creating your own custom sub-agents, but in my experience, you probably won't need that. Context isolation is by far the most useful part of this feature.
Do your pre-work#
Before asking the agent to make a change, do some legwork to minimize what it needs to search for or assume. For example, point it to the specific files it'll need to touch, or give it a short description of the project's relevant architecture. Identify what's relevant and feed it in. Don't make the agent go hunting through a large codebase. Every file it reads eats into the capacity box, and once you exceed it, the model will start forgetting findings and making dangerous assumptions.
This doesn't have to be manual work. You can deploy a sub-agent to explore the codebase and produce a concise summary with just the relevant information, discarding everything else it bumped into along the way. The "plan mode" also tends to follow this approach by itself.
What does need to be manual is knowing what you want to build. If you don't have a clear picture of the end goal, the agent probably won't either.
Rollback instead of following up#
As humans, the ability to take back what we've said is a very unfamiliar concept, one that many of us would have loved in previous interactions. We're not quite there yet for human interactions, but we are for AI interactions.
When an agent produces something you're not happy with, your instinct will be to send a follow-up message: "Actually, can you change X?" Avoid doing this. Every follow-up inflates the context with (1) the previous attempt, (2) the error, and (3) your correction.
Instead, undo the AI changes, refine your original prompt with the clarification that steers the agent in the way you want, and let the agent start fresh. You'll get a cleaner result from a shorter context. It takes time to get used to working like this, it's not intuitive, but it's worth the learning effort.
Keep living documentation#
I maintain a docs/ folder in the root of my projects with short markdown files that serve as different levels of documentation. Here's an example for a recent project:
docs/
├── project.md # Overall description, goals, how it works
├── technical-architecture.md # Main components and their interactions
├── notifications-live-activities.md # Vision + technical details for this feature
├── widgets.md # Design and functionality of iOS widgets
└── todo.md # Technical debt, bugs, things to do
Each file is at most a few hundred lines. When I need the agent to do something, I point it to the relevant file instead of having it piece together the context from the codebase. This is far more capacity-box-friendly than having the agent grep through source files or make assumptions.
I bootstrap each file myself with the structure I want, then ask sub-agents to keep them updated as changes are made. Feature files usually start with a high-level overview I've written, followed by user stories with acceptance criteria and implementation details that are generated by an agent and reviewed by me.
Let me highlight how important good, user-driven acceptance criteria are. They set a clear expectation of the desired behaviour and also help the agent write useful, user-driven tests.
This is my simplified take on spec-driven development, and it works well with coding agents because it gives them pre-packaged, right-sized context.
Be selective with AGENTS.md, tools, MCP servers, skills, etc.#
While useful, these features are effectively eating into your context. They are doing so in two ways: firstly because the tool definitions need to be provided to the agent, and secondly because the output of the tool is also put back into the context.
Only expose tools that you expect the agent to actually use, and prefer tools with predictably concise outputs over ones that can return large dumps of data.
There have recently been improvements that reduce the context bloat on the first part of the process, by giving the model a single tool to perform a tool search instead of providing all tool definitions upfront. This does reduce very significantly the tokens used for the first part of the process, but does not affect the second one. In my experience, it's easy for an LLM to naively run a single tool that returns a massive amount of data.
Tool search token savings, from https://openai.com/index/introducing-gpt-5-4/
Interested in blog posts like this one?
Receive an email when I publish new posts.
Managing complexity#
Complexity is the harder axis to manage because it requires judgment. You need to develop an intuition for when a task is too complex for a single prompt, and that intuition comes from trial-and-error.
It is also tightly coupled to the overall quality of the system: we've all experienced a simple task becoming way more complex due to the pre-existing system. Ensuring that your system has quality, both architecturally and line-by-line, is vital to keep complexity under control.
Break tasks down#
If a task feels like it would touch multiple moving parts, it's probably worth breaking it down. If you have trouble visualizing all the different required changes in your head, the agent will surely struggle.
Once again, sub-agents and plan mode can help here. You can keep the full feature prompt in the main agent and dispatch sub-agents to implement each piece independently, reporting back with their results.
Breaking things down also tends to result in smaller code changes, which makes the next point much easier.
Code-reviewing AI changes#
Review agent-generated changes the same way you'd review a colleague's pull request. This matters for several reasons:
- It catches unintended side effects and other bugs before they ship.
- It gives you a chance to keep track of technical debt that may be quietly accumulating.
- It builds your intuition for the agent's capacity box: you'll start recognizing when the agent is likely to make mistakes.
- It keeps you informed about how the system you're building actually works at a high level.
That last point is crucial. How can you be expected to judge how complex a task is if you have no idea what changes it would entail?
Traditional code review best practices still apply here. Namely, start with a high level review before moving into the line-by-line. Many times I've immediately smelled something off just by seeing lots of changes in files I didn't expect to be touched.
Use tests as guardrails#
A good testing strategy carries over directly from traditional software engineering: good tests give you confidence that agent changes don't break existing features. Configure your agent to run tests automatically as a post-change verification step, and write your testing strategy into a product-level acceptance criteria that applies across all your user stories.
Know when not to use an agent#
Some tasks are technically doable by an LLM but much better suited for a deterministic tool. Bootstrapping a Next.js project, for example: the LLM can write the files from memory, but it'll probably be out of date or miss something. Just run npx create-next-app@latest, your wallet thanks you.
Finally, where the tooling falls short#
We're jamming a completely new product development process into 2007's dev UX
I believe our biggest gap right now isn't just in increasing the capacity boxes of the foundational models, but in the UX layer that sits on top of them. The way we currently interact with these models actively pushes us toward exceeding the capacity box.
-
Chat interfaces promote back-and-forth. We're human, and we're not used to a conversation model where rolling back is better than following up. I don't have a better design to propose (if it were easy someone would have done it already), but I think more proactive summarization and sub-agent deployment could help a lot.
-
No signal when you're exceeding the box. This is hard to solve since it goes deeper than the coding agent harness, but right now you're entirely on your own to understand the limits of each model and choose the right one for each task.
-
Useful features are buried. Sub-agents in Claude Code are a great capacity box optimization, but they're not as visible or as well designed as they could be.
-
The blank canvas problem. IDEs open with an empty chat box that invites you to dump everything into one massive prompt. If tooling could nudge engineers toward a spec-driven approach (breaking down work, writing context documents, planning before coding) I think it would dramatically improve outcomes for the average engineer.
-
Manual orchestration. Agents, sub-agents, and context windows still need to be manually orchestrated. Deciding when to spawn a sub-agent, what context to give it, and when to reset a conversation are all judgment calls the engineer has to make. Without it, the system collapses and output quality degrades.
We're in a transition phase between manual IDEs and AI-driven ones. Current tools are designed for the former, and we're still ideating what the latter should look like. I expect this transition to be the major driver in software engineering productivity gains in the near future. Perhaps even more impactful than incremental improvements in foundational models.
Thank you for reading! If you have any questions or further insights, feel free to reach out.