Avoiding Long Queues

Avoiding Long Queues
Why queues matter in software development: Part III of a series

Tom & Tiffany 17th January 2023

This is part three of a series. Part 1, Two Weird Things, aims to give you a better intuitive feel for how randomness makes queues likely to grow. Part 2, The Cost of Long Queues, outlines how uncontrolled queues lead to escalating costs.

In previous posts, we saw that queues which we expected to remain under control in fact tended to get very long over time. Now that we understand cost of delay, we can see the danger to an organization that delivers value by releasing software. As the queues grow, our delay costs escalate.

Without intervention, a process that we intuitively expected to deliver value in a timely fashion, and continue to do so sustainably, actually leads us over time to a position where we cannot release enough value to cover our costs, or where our competitors beat us to the punch every time.

How can we avoid such a situation? Three ways:

Remove opportunities for queues to form, if possible;
Make hand-offs synchronous;
Limit capacity utilization.

Remove opportunities for queues to form

Queues can form throughout your development process wherever there is a hand-off from one person to another, or from one type of work phase to another. If there is an opportunity to remove such a hand-off, you should seriously consider taking it. For example: do you really need to manually test every change before it can be released to beta testers, or would it suffice to manually test in the beta version?

Take an inventory of all the points in your process where a change waits before someone carries it to the next stage, and evaluate what benefits that activity brings, and whether there is an way to achieve the same benefit without preventing a change from being integrated and deployed.

Make hand-offs synchronous

If a hand-off is unavoidable, consider making it a synchronous, blocking action rather than asynchronous. For example, if your compliance regulations require that a second person signs off on a code change, make it someone’s priority to perform such sign-offs ahead of any other work, so that they are aways available to do so, and make sure that the author of the code change doesn’t start on another task (adding to the queue) while waiting for the signoff.

Other examples of unavoidable hand-offs that can benefit from being made synchronous include: security reviews; deployment to production; code review; and acceptance testing with the product owner.

Limit capacity utilization

If the hand-off is unavoidable and it’s not feasible to make it a synchronous, blocking action, you should limit capacity utilization.

“Capacity utilization? What’s that? Just when you were starting to get practical and make some concrete recommendations, you throw in another theory term?”

Yes, sorry about that. We need to take another quick dive into queueing theory to understand what I’m talking about here, and then I promise it will get practical again.

In queueing theory — and maybe you saw this when playing with the airline check-in interactive illustration in Part 1 — there is a relation between the expected size of a queue and the extent to which the server’s capacity is used. If a server can serve a customer in 10 seconds on average, and customers arrive once every minute on average, the capacity utilization is 1/6 or 16.7%. The server will be idle most of the time, and the queue will stay under control and stay very short. At the other extreme, perhaps obviously, if customers arrive every 5 seconds on average and the server still takes 10 seconds per customer (a capacity utilization of 200%), the queue will grow over time with no upper bound. The exact shape of the relationship between capacity utilization is an exponential relation, as the graph below shows.

The graph shows the steady-state expected length of a queue against its capacity utilization. As capacity utilization approaches 100%, the length of the queue extends to infinity. The expected length of an M/M/1 queue is given by p²/1-p, where p is capacity utilization.

What we can see in the graph is that when capacity utilization is at or below roughly 80% (i.e. the server takes 8 seconds per customer, and customers arrive every 10 seconds) the queue size is fairly stable. Once we pass that point, we start to be on the steep part of the exponential curve, and the queue size quickly gets to unsustainably high levels, with correspondingly high cost of delay. This accounts for those surprising numbers from part 1, where a server with an average service time of 120 seconds, serving arrivals with an average interval of 121 seconds (capacity utilization: 120/121 = 99.17%) had an expected average queue size of 118, whereas two servers working the same queue at the same rate (capacity utilization: 60/121 = 49.59%) had an expected average queue size of 0.3.

Since capacity utilization has an exponential relationship with queue size, anything on the right of that roughly 80% mark looks pretty scary and anything on the left looks pretty reasonable. Comparing the queue sizes at 65% and 80% capacity utilization, there isn’t a huge difference. Comparing between 75% and 90%, there is. So it seems a good rule of thumb to shoot for capacity utilization for our processes of roughly 80%, but not to worry too much about being exact.

How to limit capacity utilization?

So we agree on the value of limiting capacity utilization, and we have a target of 80%. How do we achieve that? Maybe we can try to control it directly. Capacity utilization is arrival rate divided by service rate — demand divided by supply. Assuming a fixed team size, we can’t do much to control supply, but we might try to control demand directly — by planning in buffers to only accept 80% of our forecasted capacity in any given period. This approach has drawbacks, or breaks down entirely, when applied at both the team level and the individual process level.

Let’s take a look at the team level: a Scrum team could decide they want to control capacity utilization by only taking on 80% of the story points their velocity suggests they can handle in an average sprint. And indeed, if they could do this accurately and not take on any unplanned work during the duration of their sprint, they would be controlling capacity utilization at 80%. For this approach to work, we have to spend a lot of time forecasting and estimating, and every time the composition of the team changes, our velocity measurement is no longer valid. If there is a simpler approach with less overhead and more ability to respond to change, we should explore it.

When we look at the lower level of the constituent processes which make up software delivery, we find that we do have some ability to control supply. Take code review as an example. The maximum capacity for this process is achieved when all developers exclusively work on code review and do nothing else, and the minimum capacity is when no one ever performs a code review. We could theoretically mandate that developers spend exactly the right amount of time on code review so that there would be 80% demand on this process. Unfortunately, that would require us to know in advance how long code reviews would take — some function of the amount of code written, its complexity, and the level of alignment in the team about what’s important in code. I am not aware of any techniques, and I struggle to imagine any, which would enable us to forecast this before the code is written. In this case, it seems impossible to control either supply or demand to achieve a capacity utilization of 80%. If we want to optimize for throughput at each constituent stage of our process, we will find that a direct approach to controlling capacity utilization is not feasible.

With multiple queues throughout our processes and the inherent unpredictability of software development (you might say the inherent unpredictability of life) direct approaches seem doomed to failure at first contact with reality¹. Fortunately, there is a way to control capacity utilization indirectly.

Controlling capacity utilization with WIP limits

Below is a simulation of a single-step process where we apply a work-in-progress (WIP) limit to the queue. You can change the WIP limit to be applied, and let the process run its course. What you’ll see is that over time, a WIP limit creates situations where there is excess capacity available, but there is nothing in the queue waiting to be served. So instead of the team being occupied 100% of the time (100% capacity utilization) they are sometimes waiting for short periods of time (less than 100% capacity utilization).

Start of Iteration 1

Backlog has no WIP limit All items wait in the backlog until they are eventually completed.

Backlog

Done

The team is nearly always working at capacity, but the backlog grows uncontrolled.

Backlog is limited to 6 items Excess items are rejected when the backlog is above its limit.

Backlog

Done

The backlog is never allowed to grow. Occasionally, the team has excess capacity.

This illustration shows an example of how a backlog grows over time when it is unbounded compared to one that is limited in length. When a backlog is left to grow unbounded, the team will always be working at full capacity, and might never have the opportunity to catch up. Under these conditions, the backlog grows, causing the newest items to be left waiting longer and longer, accumulating high cost of delay.

How does this happen? Again, it’s all about randomness. The team deals with 5 items per cycle on average, but not the same amount every cycle. That means that in some cycles, they are able to deal with 7 items. Other cycles, they can only deal with 3. With no WIP limit on the queue and an arrival rate close to the service rate, there will almost always be something left over at the end of a cycle for them to work on and their capacity utilization will be at or near 100%, so the queue will grow. With a WIP limit on the queue — let’s say it’s 6, as we set it by default — there will be times when more items come in and we reject demand, capping the queue size at 6. It’s possible that those times coincide with the times when the team is able to work quicker than average, leading to them running out of work before new items come into the queue and therefore having to wait (less than 100% capacity utilization.)²

Rejecting work and idle hands?

“OK,” you might say, “but how does this look in practice? It sounds a bit scary when we say that we ‘reject demand’ and that the team has to wait. The whole point of this is supposed to be that we improve our process so that we can meet our users’ and stakeholders’ demands and that we work effectively, isn’t it?”

That’s a very reasonable question. First, it’s important to remember that while we’re illustrating the workings of WIP limits with a single queue, in reality we are dealing with multiple adjacent queues, and they are being worked on by multi-talented people. So when, for example, the ‘actively being programmed’ column is at its WIP limit, and no new code can be written, this does not mean that our developers have nothing they can do and are forced to sit idle, rejecting any new work that comes their way. Instead, they can look at the adjacent columns (ideally the columns to the right) and find work needing to be completed there. It helps the team shift their focus away from being busy and starting lots of things to being productive and making sure items get finished.

Second, although it can happen that the team (or an individual) runs out of work and is made to wait because of WIP limits, this is not a situation that you can expect to occur for prolonged periods of time. 80% capacity utilization doesn’t manifest as people working for four days per week and then sitting idle on a Friday, waiting for new work to be offered. Instead, you would expect to see people waiting ten minutes at a time while the process they depend on clears its queue so that they can move on, and multiple short waits like this add up to 20% of their time. The team is focussed on finishing things, not starting them.

Avoid the queues you can, control the queues you can’t

So in summary, what do I suggest you do to avoid long queues, and hence avoid high costs of delay while staying nimble and responsive to your market demands?

Recognize the massive impact of cost of delay. Build it into your financial reporting if you can.
Whenever you see an opportunity for a queue to form, ask: is it possible to avoid this queue? If we can’t avoid the queue, can we make the asynchronous hand-off into a synchronous and blocking operation?
Where queues exist, introduce slack into the system by reducing capacity utilization. Don’t try to do it directly — consider using WIP limits instead.

Establishing good WIP limits

So, you are convinced. Now you need to establish WIP limits for you team. How do you go about that?

Much has been written on this topic, and we had hoped to be able to link to a ‘ready-made’ article outlining a recommended to setting and monitoring work-in-progress limits. Unfortunately, we found that we could not link to any of the articles we found without some additional commentary and/or caveats.

So instead, we would like to invite you participate in a live reading and discussion session where we can digest the articles we found as a group, share our learnings, and discuss how the articles’ guidance matches up with the goals of establishing WIP limits, and how the suggested approach would land with your teams. The session will take place on the 2nd of February, and we’d love to see you there. We will share the list of articles ahead of time, as well as discussion prompts, to ensure a lively and thoughtful conversation. Sign up now to attend!

Join us for a reading and discussion session about introducing WIP limits.

If you are looking for support on these topics in your workplace — from promoting these ideas and conducting investigations, to hands-on training and change management — we’re available. Book a time to chat with us!

Acknowledgements

We were very grateful for feedback, comments, and other input provided by our friends and colleagues: Esther Weidauer, Justine Lera, Denis Defreyne, Fronx Wurmus, James Coglan, Alicia Hickey, Benjamin Schumann, and Michele Guido.

Tom Stuart and Tiffany Conroy are consultants with a combined 35+ years of experience developing software and leading teams. They help software teams change their processes to eliminate or reduce queues in their development cycle. Hire them! Get in touch at hello@golaminar.de.

1: Some of you might be looking at the recommended 80% capacity utilization and be thinking of schemes like 20% time, in which team members are allowed to use their Fridays to work on self-directed projects outside of the main backlog, and wondering if this is a good way to control capacity utilization. I don’t recommend this. We want to control capacity utilization so that, in the times when there is unexpected extra demand on the team, the team has a higher chance of being available to work on that demand. If we make extra headroom available by giving them 20% time, we can only get this benefit by expecting them to sacrifice the 20% time when it’s a busy time.

2: What about controlling and reducing randomness? Six Sigma is a thing, right? Unfortunately, we aren’t manufacturing bolts or performing repetitive service tasks. Software developement and operation is inherently unpredictable. The value of what we do is often in direct proportion to how novel — and hence how unpredictable — it is. It’s therefore not desirable to reduce randomness. Even if we could reduce randomness, that wouldn’t help us at high capacity utilization levels; lowering the variability will shift the inflection point of our curve, but so long as there is any variablity, the queue length will always be unbounded as capacity utilization approaches 100%.

Laminar