AI Coding Tools Are Stuck in 2007's Dev UX

David Reis|March 10, 2026 (2w ago)

We're jamming a completely new product development process into 2007's dev UX

While foundational models have leapt forward, the interfaces we use to interact with them were built around a software development process that is now vastly different. The more I ship with AI agents, the more I believe the way we interact with them actively pushes us toward failure.

Chat interfaces promote back-and-forth. It's natural to iterate through conversation, clarifying, correcting and building on previous messages. But current models and tools can't handle that pattern without degrading. The interface encourages a workflow it can't yet sustainably support.

No signal when you're exceeding the limits. Right now you're entirely on your own to understand the limits of each model and choose the right one for each task. There's no warning, no degradation indicator, nothing.

The best workarounds are buried. Some of the most effective techniques for managing agent limitations exist today, but they're not visible or well-designed enough for most engineers to discover organically.

The blank canvas problem. IDEs open with an empty chat box that invites you to dump everything into one massive prompt. If tooling could nudge engineers toward a spec-driven approach (breaking down work, writing context documents, planning before coding) I think it would dramatically improve outcomes for the average engineer.

Manual orchestration. Agents and their sub-features still need to be manually managed. These are judgment calls the engineer has to make on every interaction, and without them, output quality degrades.

I expect this transition to AI-native dev UX to be the major driver in software engineering productivity gains in the near future. Perhaps even more impactful than incremental improvements in foundational models.

Until that happens, we need to work with what we have. That starts with understanding why agents degrade in the first place.

The Capacity Box#

Every LLM has a capacity box: a two-dimensional limit defined by context length (how much information is in the conversation) and logical complexity (how hard the task is). Depending on the model's size and training, this box can be bigger or smaller.

When you push past the box, the model doesn't throw an error or warn you. Instead, it silently degrades. This is where you start to get incomplete code, corners being cut, and confident outputs that are subtly wrong. LLMs fail silently when you exceed their limits, and this is something that anyone working with these agents has to learn how to deal with.

Because different models have different capacity boxes, the same task can work perfectly with one model and fail with another. In the example below, this particular demand would overwhelm Haiku, and be well handled by Sonnet and Opus (with the latter being overqualified for the job).

The two dimensions of the box, context and complexity, are both things you must control. If you ask the model to do an extremely complex task with a short prompt, it can still struggle. If you ask it to do a simple task but the conversation has been going on for a while, it can also struggle. If you push hard on both dimensions you're almost guaranteed a bad result.

As a software engineer, it's your duty to steer your agent in a way that is mindful of these two axes. If you do it well, you'll be able to reliably use AI Agents in any codebase, no matter how extensive or complex it is. Let's dive into both dimensions, I'll share what has worked for me and what hasn't.

Managing context#

Context is the easier axis to manage. The goal is simple: make sure the current context window contains everything the agent needs to perform the task, and nothing more.

Use sub-agents for context isolation#

Claude Code, Cursor, and many others have a sub-agents feature, which allows the engineer (or the agent itself) to delegate tasks to specialized agents that run in isolated contexts. This is one of the best ways to optimize the capacity box, because the sub-agent does its work in its own context window and only returns a concise result to the main conversation.

I use sub-agents liberally for tasks that require reading a lot of code, searching through very large files, or generating a lot of intermediate output that would be irrelevant to the main conversation. Usage depends on the tool you're using, but there are two ways they are invoked: automatically by the main agent in some cases, or by your explicit request in your prompt.

Note that there's a lot of focus on creating your own custom sub-agents, but in my experience, you probably won't need that. Context isolation is by far the most useful part of this feature.

Do your pre-work#

Before asking the agent to make a change, do some legwork to minimize what it needs to search for or assume. For example, point it to the specific files it'll need to touch, or give it a short description of the project's relevant architecture. Identify what's relevant and feed it in. Don't make the agent go hunting through a large codebase. Every file it reads eats into the capacity box, and once you exceed it, the model will start forgetting findings and making dangerous assumptions.

This doesn't have to be manual work. You can deploy a sub-agent to explore the codebase and produce a concise summary with just the relevant information, discarding everything else it bumped into along the way. The "plan mode" also tends to follow this approach by itself.

What does need to be manual is knowing what you want to build. If you don't have a clear picture of the end goal, the agent probably won't either.

Rollback instead of following up#

As humans, the ability to take back what we've said is a very unfamiliar concept, one that many of us would have loved in previous interactions. We're not quite there yet for human interactions, but we are for AI interactions.

When an agent produces something you're not happy with, your instinct will be to send a follow-up message: "Actually, can you change X?" Avoid doing this. Every follow-up inflates the context with (1) the previous attempt, (2) the error, and (3) your correction.

Instead, undo the AI changes, refine your original prompt with the clarification that steers the agent in the way you want, and let the agent start fresh. You'll get a cleaner result from a shorter context. It takes time to get used to working like this, it's not intuitive, but it's worth the learning effort.

Keep living documentation#

I maintain a docs/ folder in the root of my projects with short markdown files that serve as different levels of documentation. Here's an example for a recent project:

docs/
├── project.md                         # Overall description, goals, how it works
├── technical-architecture.md          # Main components and their interactions
├── notifications-live-activities.md   # Vision + technical details for this feature
├── widgets.md                         # Design and functionality of iOS widgets
└── todo.md                            # Technical debt, bugs, things to do

Each file is at most a few hundred lines. When I need the agent to do something, I point it to the relevant file instead of having it piece together the context from the codebase. This is far more capacity-box-friendly than having the agent grep through source files or make assumptions.

I bootstrap each file myself with the structure I want, then ask sub-agents to keep them updated as changes are made. Feature files usually start with a high-level overview I've written, followed by user stories with acceptance criteria and implementation details that are generated by an agent and reviewed by me.

Let me highlight how important good, user-driven acceptance criteria are. They set a clear expectation of the desired behaviour and also help the agent write useful, user-driven tests.

This is my simplified take on spec-driven development, and it works well with coding agents because it gives them pre-packaged, right-sized context.

Be selective with AGENTS.md, tools, MCP servers, skills, etc.#

While useful, these features are effectively eating into your context. They are doing so in two ways: firstly because the tool definitions need to be provided to the agent, and secondly because the output of the tool is also put back into the context.

Only expose tools that you expect the agent to actually use, and prefer tools with predictably concise outputs over ones that can return large dumps of data.

There have recently been improvements that reduce the context bloat on the first part of the process, by giving the model a single tool to perform a tool search instead of providing all tool definitions upfront. This does reduce very significantly the tokens used for the first part of the process, but does not affect the second one. In my experience, it's easy for an LLM to naively run a single tool that returns a massive amount of data.

Tool search token savings, from https://openai.com/index/introducing-gpt-5-4/

Interested in blog posts like this one?

Receive an email when I publish new posts.

No ads, no spam, just content.

Managing complexity#

Complexity is the harder axis to manage because it requires judgment. You need to develop an intuition for when a task is too complex for a single prompt, and that intuition comes from trial-and-error.

It is also tightly coupled to the overall quality of the system: we've all experienced a simple task becoming way more complex due to the pre-existing system. Ensuring that your system has quality, both architecturally and line-by-line, is vital to keep complexity under control.

Break tasks down#

If a task feels like it would touch multiple moving parts, it's probably worth breaking it down. If you have trouble visualizing all the different required changes in your head, the agent will surely struggle.

Once again, sub-agents and plan mode can help here. You can keep the full feature prompt in the main agent and dispatch sub-agents to implement each piece independently, reporting back with their results.

Breaking things down also tends to result in smaller code changes, which makes the next point much easier.

Code-reviewing AI changes#

Review agent-generated changes the same way you'd review a colleague's pull request. This matters for several reasons:

It catches unintended side effects and other bugs before they ship.
It gives you a chance to keep track of technical debt that may be quietly accumulating.
It builds your intuition for the agent's capacity box: you'll start recognizing when the agent is likely to make mistakes.
It keeps you informed about how the system you're building actually works at a high level.

That last point is crucial. How can you be expected to judge how complex a task is if you have no idea what changes it would entail?

Traditional code review best practices still apply here. Namely, start with a high level review before moving into the line-by-line. Many times I've immediately smelled something off just by seeing lots of changes in files I didn't expect to be touched.

Use tests as guardrails#

A good testing strategy carries over directly from traditional software engineering: good tests give you confidence that agent changes don't break existing features. Configure your agent to run tests automatically as a post-change verification step, and write your testing strategy into a product-level acceptance criteria that applies across all your user stories.

Know when not to use an agent#

Some tasks are technically doable by an LLM but much better suited for a deterministic tool. Bootstrapping a Next.js project, for example: the LLM can write the files from memory, but it'll probably be out of date or miss something. Just run npx create-next-app@latest, your wallet thanks you.

This post has an expiration date#

I believe that with the right combination of tooling and foundational model improvements, this advice will be largely obsolete. Until then, I hope it was at least a little bit useful.

Though if any of it sounded familiar (break tasks down, write clear specs, keep PRs short), it should. It was good but often overlooked advice long before AI agents existed, and it's somewhat funny to me that we're now all trying to follow it religiously.

Thank you for reading! If you have any questions or further insights, feel free to reach out.