Laminar

The Cost of Long Queues
Why queues matter in software development: Part II of a series

This is part two of a series. Part 1, Two Weird Things, aims to give you a better intuitive feel for how randomness makes queues likely to grow. In this part, you’ll explore the surprising and escalating costs of unmanaged queues. In the next part, you’ll see how to avoid long queues in software development processes.

In part one of this series, we showed you that queues have a tendency to grow much longer than we intuitively expect. Queues are everywhere in software development processes: any time an item of work is waiting somewhere and not actively being worked on, it is in a queue. Any time we move work from one state or activity to another, and especially when the next activity is performed by another person or another team, there is an opportunity for a queue to form. Examples of queues in software development processes include: the list of pending pull requests; the product and sprint backlogs; the list of changes in a given release or deployment.

Why is it a problem if queues get long? Let’s revisit our airport check-in queue. The longer the queue gets, the more dissatisfied our customers get. Our queue may become long enough that people cannot get checked in and through security to their gate in time; we need additional staff to walk through the queue, calling forward all the passengers for flights soon to depart for expedited check-in. If the queue grows long enough, even expediting won’t help us, and people will start missing their flights. If this happens too often, the airport ceases to function. So there is a cost to having people stand in line for longer, and that cost escalates and has inflection points where additional costs arise.

In software development, there are also direct costs to queues. The longer a product backlog becomes, the more maintenance it requires: Is this story still relevant? What does this comment from Eliza 18 months ago mean? Is that market research still valid? The longer feature requests wait, the more people ask for status updates, and then someone has to find out and report back the status. The more stuff we’re confronted with, the more effort we have to expend on determining what that stuff is, and what of that stuff is the best thing for us to pick up next1. Those are some of the direct costs of having long queues. Some people call these the ‘carrying costs’.

But there’s also a significant indirect cost. We call it cost of delay2. In the case of a product backlog item, its cost of delay is the amount of value that could be realized per day [or hour, or minute], for each day [hour, minute] that it is delayed and not moving through our process.

To take a concrete example: we think that allowing users of our website to ‘like’ each others’ posts will lead to a significant increase in their engagement, and hence (e.g. via advertising impressions) to our profits. After the feature is finally deployed, our analysts tell us that we have made an additional €1000 per week as a result of launching it. Unfortunately, there were various points in the development of this feature where it was in progress, but waiting. After the PM, designer, and tech lead had drafted the 5 constituent stories, the engineers weren’t available to start working on them for another 4 weeks. For each story, a pull request waited an average of 4 hours for code review while potential reviewers worked on other things. After code review, rework waited for an average of 2 hours per change while the engineer responsible was working on the next code change. And so on. Before long, we’ve amassed a comparatively huge amount of wait time. Maybe we only worked on this feature for one-person week, but it was waiting somewhere in our system for 5 weeks.

The more items there are waiting in a queue, the longer it takes for a new item entering the queue to receive attention and be actively worked on. So the longer the queue, the more delay cost there is associated with it. Cost of delay can quickly become a huge deal, and quickly outweigh all other costs.

Let’s run some numbers to see why. Imagine we pay a sole developer €50 per hour, i.e. €2,000 per 40-hour work week3. Let’s also assume that the ongoing value of each feature they deliver, once released, is €2,000 per week. Then let’s imagine there are 10 features in the development backlog, each requiring 1 week of that developer’s time to complete. A new item entering the queue will need to wait 10 weeks for the features ahead of it to be completed before it receives attention, so the cost of delay we pay on that new item is €20,000. In those weeks, we’re also paying the cost of delay on every other feature waiting in the queue – €18,000 for the 10th feature that waited 9 weeks, €16,000 for the 8th feature, and so on. So in the time it takes for that new feature to reach the front of the queue, we have paid €90,000 in delay cost for the other features ahead of it in this queue. Meanwhile, 10 weeks of developer time have passed, costing us €20,000.

Below, you can insert your own estimates for how much features cost, how many items are in the queue, and how long they take to complete, so that you can get a feel for how cost of delay will mount up.

Backlog contains 10 items
Each item is expected to earn €2,000 / week
Each item costs €2,000 to develop.
This illustration shows how items waiting in a queue accumulate cost of delay during the time they are waiting. Items further back in the queue accumulate higher of cost of delay. The cost of development becomes smaller in comparison as the cost of delay accumulates. This idealized scenario assumes that the queue is worked on until it is cleared and no other items join the queue during that time.

Maybe you ran the numbers and weren’t too concerned about cost of delay. Perhaps you only had 5 items in your queue, and they were worth something like €100 per week. In that case, your staffing costs might have been higher, or comparable, and you’re wondering why I’m trying to make you care about cost of delay. That’s fair, but remember: queues grow if left uncontrolled, so 5 items is not a state you can expect to maintain for long without intervention.

Also note that we’re only talking about one queue here, and a typical development process has multiple stages, with multiple opportunities for a queue to form and delay cost to be paid4. Rather than thinking about a single-step process with only three states (waiting, being worked on, complete), we need to think about a typical development process, with multiple stages, multiple hand-offs, and multiple queues. When that’s our situation, it’s not uncommon for the time an item is being worked on to be a fraction of the total time it’s in the process — it spends far more time waiting. And since uncontrolled queues have a tendency to grow, this problem will get worse over time even if you have the most productive team in the world.

The other crucial thing that’s missing in the simulation is the fact that while we’re working on features, new feature requests can come in and be added to the queue. Below, we’ve paired our queue growth simulation from part 1 with a calculation of cost of delay. Observe how the queue’s uncontrolled growth — even if we have similar rates of arrival and service — causes cost of delay to increase, and how even if we recover and our queue size drops to zero, we have still paid a high cost of delay. And we’re not even estimating the direct costs of the queue that we mentioned at the start of the article.

On average, 5 items arrive per week.
On average, the team can complete 5 items per week.
Each item is expected to earn €2,000 / week
Development time costs €10,000 / week.
This illustration shows the behaviour and costs of a backlog operating over time, where items enter the queue at random intervals and development velocity is subject to random variablity. Such queues operating at nearly 100% utilization have a tendency to grow, which means the cost of delay per item grows. For simplicity, development times and costs have been applied uniformly, not accounting for working hours, and cumulative costs are computed per item completed.

Note that it is possible to address the problem of growing queues by making teams more productive. If we can add people, or train the existing people, we can increase the amount of stuff the team can handle in a given time period. If we don’t correspondingly increase demand on the team (and that’s a pretty big ‘if’, in my experience) we would expect to see shorter queues.

Two problems with this: first, it’s expensive. Training costs money. Extra people cost money. Retaining good people costs money. Second, there are limits to how much you can increase capacity — even if you have the best programmers in the world, there is a limit to the speed at which they can solve problems until a new programming paradigm comes along; there is a limit to the intensity at which even the most resilient, dedicated and time-rich people can work; and there is a limit to the size of a team before it becomes too difficult to coordinate. If there are cheaper ways to achieve the same goal of shorter queues, which are not subject to limits, we should explore them.

Meanwhile, demand on the team is potentially unlimited. There is always something else we could be working on. Worse than that: as we add capacity to the team, demand is likely to increase, because stakeholders ask “we’ve added people to this team, shouldn’t we be able to do more now?” Adding capacity works only in its relation to demand — if, after training, our team can now deliver 6 rather than 5 items per week, but our stakeholders are also asking for an extra item per week, our training has done nothing to control the size of the queue.

I’d also like to note, in brief, the human costs of processes with long queues. Our focus so far has been on directly measurable bottom-line costs, but there are significant downsides for the humans working in a process with long wait times. If developers are working on features which the designers worked on months ago, then both sides are robbed of an opportunity to work together closely. If anything, the developer coming to the designer wanting to revisit the design because it doesn’t work well with the technical solution — theoretically an opportunity to produce a better design together — will result in frustration for the designer, who struggles to remember their work from all that time ago, and hasn’t planned the time for rework. This pattern repeats anywhere that work is handed off from one person to another, but waits in between: the waiting time causes frustration on both sides, incentivizes looser collaboration and lower quality, and robs people of the opportunity to work closely together and learn from each other.

We need to avoid the scenario where our queues continue to grow and our costs eventually (but sooner than we intuitively expect) reach the point where they outweigh all other costs. The loss of revenue does not show up as a negative item on our balance sheet. Instead, you see a decreasing ability to innovate, to satisfy customer demand, or to compete with a brand-new entrant to our market, eventually posing existential risks to the organization. If we want to avoid it, we must do something to prevent queues from getting out of control. That’s what we’ll cover in part 3.

Tom Stuart and Tiffany Conroy are consultants with a combined 35+ years of experience developing software and leading teams. They help software teams change their processes to eliminate or reduce queues in their development cycle. Hire them! Get in touch at hello@golaminar.de.

1: I’m quite certain that by now there are readers who are practically screaming at me: “but queues in product development aren’t like airport check-in queues! They’re not first-in, first-out. We prioritize them.” This is absolutely correct, and a full response to this critique is beyond the scope of this article. In lieu of a full response, I’ll say some provocative things here and promise to follow up with a full examination later. First, it is unlikely you are prioritizing optimally. Second, prioritizing optimally only mitigates the cost of queues. It doesn’t solve it. Third, there is a cost to prioritizing which scales non-linearly with the size of the queue.

2: The good people at Black Swan Farming have explained cost of delay far better than I do here, and even have a handy video explainer. I recommend you check out that explanation too.

3: €50 per hour, 40 hours a week, translates to €92,000 for a full year (subtracting 6 weeks for annual leave and public holidays). Taking into account other costs of having engineers on staff, this seems like a pretty reasonable ballpark figure.

4: In that typical development process with dependent activities, our problems can become even worse, because even if step 4 is running well and able to work at high capacity, a slowdown at an earlier step can starve the process at step 4 of work. An example to illustrate: Assume you have a process with 6 steps. Each step can process between 1 and 6 items per day. Each day, we throw a dice for each step to determine how many items they could process that day. How many items will the process deliver per day, on average? 3? 3.5? It’s 1.436!