Tuesday, April 28, 2026

Three levels. One flat list. That's why refinement never ends.

Milica

Intermediate7 min readBefore the Backlog

There is a question that shows up in every refinement session I have ever been part of. Not the same words every time - but the same shape. "What should happen when the user doesn't have an account?" Or: "Does this apply to guest users too?" Or, the classic: "How do we actually know when this story is done?"

The team discusses. Someone makes a call. The story goes into the sprint. Four weeks later, the same type of question surfaces on a different story.

The pattern is familiar. What most teams don't see: the problem isn't missing detail. It's missing layers.

The question that never leaves

I've watched this happen in teams that are genuinely good at their work. Experienced developers. A product owner who knows the domain. Someone on the team who actually cares about completeness. And still, refinement keeps producing the same categories of questions, session after session.

Not because people come unprepared. Because the answers to those questions are systematically absent from the backlog - in every story, on every project, regardless of how much effort the team puts in.

The pattern I see most often: it's never random questions. It's always questions of the same kind. Questions trying to establish what "done" actually looks like for this specific situation.

Not a communication problem. A structure problem.

The fix that doesn't fix anything

The obvious diagnosis is missing detail. More acceptance criteria. Stricter templates. A required field in Jira that has to be filled before a story can move to the sprint.

I understand that reflex. It feels productive. And it improves things - slightly, temporarily.

But it doesn't solve the problem. Because the problem isn't depth. It's layering.

An epic is not a large user story. A scenario is not a poorly written acceptance criterion. They are different things at different levels - and teams that don't separate them end up mixing three distinct languages in one text field: planning, specification, testing. All collapsed into a single Jira description.

More text in the same flat structure doesn't help. It just makes the right questions harder to find.

Three languages that look like one

Requirements have three natural levels, and each answers a different question.

The first level describes the user goal: what does the user want to achieve? This is the epic. It's intentional, not functional. "User can sign in to the system" is a user goal. It describes the value - not how the system delivers it. This is planning language. At this level, you make strategic decisions: what belongs in the product, what doesn't, what comes first.

The second level describes system behavior: what should the system do? This is the user story. It's the contract between product and engineering. "As a user, I want to sign in with my email and password so that I can access my account." It names a specific interaction path - but not yet what success looks like for every individual case.

The third level describes the observable outcome: how do you know it works? This is the scenario. It's the only level that is testable. "Given the user enters valid credentials - When they click Sign In - Then they are redirected to their dashboard." No interpretation required. No implicit assumptions. An observation that is either true or it isn't.

All three levels describe the same feature - but each answers a different question. Keeping them in one text field doesn't produce bad requirements. It produces a structure that makes the right questions unaskable.

You already know this model

Here's what I find interesting about the three-level model: it isn't new. Most teams working seriously with requirements already understand it - they just haven't named it.

If you've used User Story Mapping, you've seen it. Jeff Patton's map starts with Activities - the goals users try to accomplish. Underneath each activity are Steps, the specific interactions that support it. And below those: Details, the variations, edge cases, the "what happens when." Level one, level two, level three.

If you've worked with BDD, you've seen it there too. An epic sits above the file structure - it groups a set of Feature files around one user goal. Each Feature file then covers a coherent piece of system behavior: one capability, one interaction area. That's Level 2. Within each file, individual Scenarios are the testable units - Given/When/Then, a single observable outcome, no interpretation required. That's Level 3. BDD just names its layers differently. The structure is the same.

And if you've ever run a Three Amigos session, you've had the three levels in the room - you just didn't label them. The product owner speaks in goals: why the feature exists, what user value it's supposed to deliver. The developer speaks in behavior: what the system would actually need to do. The QA engineer speaks in scenarios: what would have to happen for this to be verifiable. Three people, three levels, one conversation. The friction in a bad Three Amigos session is almost always that everyone is talking at a different level without anyone noticing.

The three-level model isn't a methodology sitting alongside User Story Mapping, BDD, and Three Amigos. It's the structure all three already assume. Teams that work with any of these practices have the model in their hands - they've just never seen it written down as a model.

And that's where the trouble starts. Without a shared model, there's no shared language. Without shared language, each person writes at whatever level feels natural to them - the product owner writes at level one, the tech lead writes at level two, the QA engineer writes at level three. Everything lands in the same field. The result looks like a requirement. It isn't one.

What happens when the layers collapse

Take a concrete example. A team is building authentication. In the backlog: an epic called "Authentication." Beneath it, a user story: "User can log in." Beneath that, three acceptance criteria:

The user can sign in with email and password
Invalid input is reported
The system is secure

All three bullets sit at the same level, nested under one ticket. But they aren't the same thing.

The first is a user story written as an acceptance criterion - it describes system behavior (level two), not an observable outcome. The second gestures at a scenario but defines nothing testable: which invalid input? What does "reported" mean - a toast, an inline error, a redirect? The third is a non-functional requirement. It doesn't belong in a user story at all.

What happens in refinement? "Does 'invalid input' include an unknown email address, or just a wrong password?" - "What about rate limiting - is that in scope?" - "Where does the error message appear?"

Every one of these questions is legitimate. All of them could have been answered before the sprint. A scenario covering "user enters an unrecognized email address" makes the first question unnecessary. A separate NFR document pulls the security requirement out of the ticket. An explicit third level separates testability from plannability.

Instead, everything lives in one field. The developer interprets what was meant during implementation. Sometimes correctly. Sometimes not.

Planning language is not test language

The distinction that matters is this one.

Planning language describes intent. "User can sign in" is intent - it says why the feature exists. You cannot test an epic. You can only decide whether it has been fully addressed.

Test language describes observation. "Given the user enters an email address that doesn't exist in the system - When they submit the form - Then they see the message: We don't recognize that email address" is an observation. It's either true or it isn't. Nothing in between.

When you write both languages into the same field, you get neither a good plan nor a testable specification. You get a document that looks like both and functions as neither.

Refinement questions don't decrease when you write more. They decrease when you start writing at the right level - when epics describe intent, stories describe behavior, and scenarios pin down what you can actually observe. The structure doesn't add bureaucracy. It removes the ambiguity that refinement has been compensating for all along.

Tools like Speclr treat these three levels as separate objects - epics, stories, and scenarios each live in their own space, with their own fields, not as nested bullet points inside a ticket. The separation isn't a feature. It's what makes the model actually work instead of staying a concept on a whiteboard.

If you want to see what a three-level backlog looks like in practice, try Speclr - no setup, no prior requirements document needed.