Speclr Logo

Coding Agents are the sloppiest developers you'll ever hire

A lone software engineer sitting at a desk in a dark office at 2am.

02:14. The monitoring dashboard is red.

Not one service. Three. The alert reads HTTP 503 - upstream timeout, but that's not a cause - that's a symptom. I open the logs.

The company runs a B2B accounting platform. Mid-size, German market, several thousand active firms. Tonight is the last night before quarterly tax submission deadlines. Half of Germany's bookkeepers are working late. The other half will try to submit first thing in the morning. Either way, the next six hours matter more than any other six hours this quarter.

The logs are full. Requests coming in normally. Responses not going out. Something is blocking.

I pull thread pool metrics. Saturation at 100%. Has been for eleven minutes - long enough for every queue to back up, long enough for upstream clients to start retrying. I check the external API calls. There it is: the federal tax authority's submission endpoint - responding, but slowly. Not down. Slow. Average response time has climbed from 180ms to 4,200ms sometime in the last twenty minutes. Peak filing traffic on their end too, presumably.

Here's what should happen when an external dependency degrades: the system detects it, stops sending new requests into the bottleneck, sheds load gracefully, returns a clean error to the client. "Tax authority temporarily unavailable. Your data is saved. Try again in a few minutes." Traffic redistributes. The system recovers.

Here's what actually happened: every new request fired a call to the tax authority's endpoint. Every call waited 4,200ms. Threads held. New requests arrived. More threads held. The thread pool hit its ceiling. At that point, requests that had nothing to do with tax submissions - invoice generation, document exports, dashboard loads - couldn't get a thread either. The whole application stalled.

Then the retries started.

The client SDK retried on timeout. With exponential backoff. Which meant every client that had timed out waited a few seconds and tried again. All of them. At roughly the same time. The tax authority's API, already struggling, received a second wave. Response times climbed to 11 seconds. More threads held. More clients retried.

By 02:31, the application was completely unresponsive. Not degraded - down.

I find the mitigation at 03:18: circuit breaker tripped manually, thread pool configuration changed, tax submission integration routed to a queue. The application recovers in four minutes. The postmortem takes three weeks.

One line. Wrong pattern.

The root cause wasn't the tax authority's API being slow. External dependencies are slow sometimes. That's not a failure condition - it's a design constraint.

The root cause was one architectural decision in the integration layer. No Circuit Breaker. No Bulkhead isolating the tax submission thread pool from the rest of the application. And retry logic implemented as exponential backoff without jitter - the textbook pattern from every AWS blog post and SDK documentation page you've ever read.

That last part is important. Retry-with-exponential-backoff is not wrong. It's the correct pattern for transient failures on reliable infrastructure. It is the wrong pattern when your dependency is slow under load, because it turns a degraded service into a synchronized retry storm. Everyone backs off for the same interval. Everyone retries at the same time. The thundering herd hits an already-struggling system and finishes it off.

A Circuit Breaker would have detected the degradation after the first few slow responses and opened - stopped sending requests to the tax authority entirely, returned fast failures to callers, let the thread pool breathe. A Bulkhead would have contained the damage to the tax submission pool, kept invoice generation and dashboards alive. These are not obscure patterns. Michael Nygard described them in Release It! in 2007.

The integration layer had been written by a coding agent. I had reviewed the code. It looked fine. It was fine - under normal conditions. Under deadline-night load with a degraded external dependency, it was a trap.

Instinct is not talent. It's scar tissue.

A senior developer with production experience would not have built it this way. Not because she had memorized Nygard. Because she had been on call.

She had watched a retry loop turn a minor blip into a full outage. She had traced thread pool exhaustion through a system at 2AM, reading metrics in the dark, trying to find the bottleneck before the customers in Asia Pacific woke up. She had written the postmortem that said "retry logic did not account for synchronized backoff under load" and she had felt what it costs - the hours, the apology emails, the sprint that went sideways because half the team spent the week on incident review.

That experience doesn't live in documentation. It lives in her fingers. When she looks at an integration with an external dependency, she doesn't run through a checklist - she just knows something is missing. The absence of a circuit breaker registers the way a carpenter registers a joint that isn't quite square. Before she's consciously reasoned through it.

This is what senior actually means. Not more facts, not faster output. It's that failure modes are physically encoded. She carries the memory of systems that broke, and that memory shows up in decisions nobody explicitly asked her to make.

Instinct is a skill - probably the most important one a developer has - and it's built from consequence.

A coding agent has no consequence. It's never been on call. It has never had to look at its own code at 2AM and figure out what it got wrong. It processes patterns from documentation, Stack Overflow answers, GitHub repositories, technical books - including Nygard, probably. It knows what a Circuit Breaker is. It can implement one if you ask.

But when a spec doesn't mention resilience patterns, it doesn't feel their absence. It reaches for the most statistically common solution in its training data. Retry-with-exponential-backoff appears in thousands of code samples. Circuit Breaker with Bulkhead isolation appears in far fewer. So it writes what it has seen most often, in contexts where that pattern works fine. It doesn't know it's making a choice. It doesn't know what it doesn't know.

Every gap is a decision the agent makes without you.

This is the thing that's easy to miss when coding agents are working well.

When an agent produces correct, clean, well-structured code - and it often does - you see the output and you trust the process. What you don't see is everything that got decided in the gaps. Every place the spec was silent, the agent picked something. A pattern, a default, something to optimize for. Those choices are invisible because the code looks fine.

Until deadline night.

The agent isn't malicious or lazy. It's not cutting corners the way a junior dev might, rushing to finish and hoping the edge cases don't bite. The problem is more basic: it has no professional conscience, because professional conscience is built from consequence. It's never been responsible for a system in production. Nobody calls it at 2AM.

So it produces code that works under the conditions it was trained on. Under the conditions you'll actually face, it may not.

Human developers are tidy not because they were told to be. They are tidy because they have been the ones maintaining the mess. The instinct toward homogeneity, toward defensive patterns, toward specification that leaves nothing to chance - that instinct comes from pain. From systems that broke in specific ways that are now burned into memory.

The agent has no such memory. Every project is its first project.

The fix isn't a better prompt.

You can't prompt your way to production resilience. You can tell an agent to use Circuit Breakers - and it will use them, in that specific call, for that specific integration. The next integration, the next external dependency, the next place where a resilience pattern matters: if the spec doesn't mention it, the agent won't reach for it.

Here's the objection I hear: AI has read millions of postmortems. Stack Overflow threads, incident reports, engineering blogs about exactly this kind of failure. Doesn't that count as experience?

No. And the reason is simple: production is unpredictable in ways that documentation doesn't capture. A postmortem describes what happened after the fact, cleaned up, causally ordered, with a root cause that only became obvious in retrospect. The actual 2AM is noise, dead ends, false hypotheses, and a thread pool metric that didn't show what you thought it showed for the first forty minutes. That gap - between described failure and lived failure - is exactly where instinct lives. Reading about consequence is not the same as carrying it.

The answer is precision before the agent opens a file. Specs that say not just what a feature should do, but which patterns are required, which defaults are off-limits, how the system should behave when a dependency goes sideways. Not as a prompt - as a written constraint, the same way you'd capture it in an ADR or an interface contract.

Not better prompts. Not smarter agents - though those are coming. The real leverage is upstream, before the agent makes any decisions at all.

Speclr is built for this side of the problem: capturing technical constraints in structured specs derived from discovery and planning, before the first line of code gets written.

Both made a decision at the spec gap. One of them had been on call before.

What the agent produced that night was not a mistake in any meaningful sense. It was a reasonable implementation of an underspecified requirement. The spec said "integrate with the tax submission endpoint." It said nothing about what happens when that endpoint is slow. So the agent picked the most common pattern in its training data, in codebases where that pattern works fine.

The experienced developer doesn't wait to be told. She's already seen this failure. She specifies against it before anyone asks, because shipping without that protection feels wrong to her the way leaving a load-bearing wall out of a floor plan feels wrong.

Coding agents will keep getting better. Pattern recognition, code quality, context handling - all of it will improve. But consequence won't transfer. You can't train a system to care about production by feeding it postmortems.

Which means the burden sits upstream. Specs that cover not just what a feature does, but how it fails. The gap between what you specified and what the agent decided is yours to close - before the agent opens the file, not after.

The agent isn't sloppy because it's dumb. It's sloppy because it's never been on call. And it never will be.

When your specs are precise enough that the agent has no architectural decisions left to make, you get the best version of what coding agents can do. Speclr is where those specs get written - from discovery through to technical constraints, before the agent opens a file. Try it now

Tags

coding-agentsai-developmentresilience-patternstechnical-specsspec-driven-developmentrequirements-engineering