Want To Ship Features Faster? Fix All Your Bugs
How fixing all bugs with an SLA can increase feature velocity
For many, launching new features can mean the difference between survival and insolvency. Doing anything outside the critical path to revenue can put you out of business.
In this scenario, the natural instinct is to only fix bugs that are critical right now.
However, doing this continually is like taking out a payday loan each week to repay the last; the cost will quickly dwarf whatever benefit came from having money early.
The problem is that bugs very quickly get a lot more expensive to fix, easily doubling or more within weeks.
There is an alternative: Fix all bugs within one development iteration (i.e., sprint) or mark them as “won’t fix.”
I do this now. At first it felt wrong to work on lower-priority fixes before important new features. However, the benefits of bug-free development soon emerged:
Roadmap planning becomes easier with fewer interruptions and not having to allocate time for backlogged bugs.
You stop having to make daily decisions about when to fix each bug.
You don’t have to worry about customer emergencies caused by bugs, which reduces stress for everyone and improves the customer experience.
Ultimately, the rate of new bugs goes down as developers invest in better test automation.
Fixing all your bugs quickly is like making your bed in the morning. If you’re going to do it, it’s best to get it done right away so there is one less task hanging over your head.
In the rest of this article, we look at why bug cost escalates so quickly. We then share actionable strategies for classifying, prioritizing, and tracking bugs with an SLA to minimize their total cost and increase feature velocity.
Why is waiting to fix bugs so expensive?
Emergencies are costly
If you only fix bugs that are having an immediate impact, then every bug fix will be urgent. If a key customer is calling you about a bug, then they expect a quick resolution.
In the previous article on Calculating Your Interruption Tax, we explored why increasing levels of urgency amplify the cost of a task. The same bug may cost 4x as much to fix if someone is paged in the middle of the night, or 2x as much if it has to be done the same business day.
When you defer bug fixes until they become urgent, you make each one a lot more painful.
People lose knowledge over time
If you’re writing code, the best type of bug is one where your editor underlines it in red the second you finish typing. You know what you’re trying to accomplish, and you can quickly address the issue without interrupting your flow.
As time goes by, it becomes increasingly difficult to isolate and fix bugs. First, you lose your working memory and have to re-familiarize yourself with the surrounding code to understand the problem. Then, you forget more and more as weeks go by.
Once enough time elapses, it can be hard to even know who the best person is to fix a bug, or the person with the best knowledge may not be around any longer, which can turn a bug fix that would have taken ten minutes into a week-long affair.
Even bugs gain dependencies
For some bugs, the size of the fix might be the same whether you fix it now or later. For others, you may build a lot of other functionality on top of a flawed architecture that you have to rip out to fix the bug.
The problem is that it’s hard to tell which is which until you fix the bug.
If you wait a long time, then some bugs become really nasty. This creates a risk that if a bad bug becomes an emergency, you won’t be able to fix it fast enough and might lose a customer.
If you fix bugs right away, not only is each fix easier, but you eliminate this deeper business risk.
Customers suffer
While we’re mainly concerned with the impact on feature development, even minor bugs chip away at the customer experience.
It’s frustrating, but I’ve heard customers say that my software was “buggy” after encountering only a few issues I considered minor, like unusually long text overflowing off the screen.
The more small bugs you have, the lower the overall perception of quality, which can have a real but difficult-to-measure effect on revenue.
Though it’s hard to quantify the customer impact of having many minor bugs, the cost of zero bugs is easy to calculate: it’s zero.
Employees suffer
Having open bugs also places a burden on employees. At the very least it increases time spent receiving bug reports and triaging them to determine if they are duplicates.
Certain bugs also waste time for developers and others in the organization. Anything that causes alarm noise, internal system failures, manual workarounds, etc. takes a toll on people and ultimately slows down value-adding work.
I once interviewed an engineer whose 100-person company had an entire team of developers just working on scripts to patch over data corruption issues caused by unfixed bugs. Don’t let this be you!
Backlog management overhead is substantial
If you regularly defer bugs, then you have to spend time managing the bug backlog. The more issues in that backlog, the longer it takes.
You pay this cost each time you look at the backlog, and it gets even harder as bugs age and you lose context.
Not having a bug backlog avoids this.
Deciding whether to fix each bug takes time
If you fix (or decide not to fix) each bug immediately, then your decisions are easy.
If you don’t fix your bugs right away, then you have to decide when to fix each one. To do this, you have to analyze the impact and compare it to the value of new feature work.
This cross-comparison between bug and feature value adds another layer of planning complexity for product managers. Not doing it frees them up to spend more time solving problems for customers.
Not fixing bugs creates a moral hazard for developers and managers
In addition to the direct cost of deferring bugs, the indirect cost further compounds by reducing incentives to test code well in the first place.
If developers know they have to fix bugs right away, it’s easy for them to decide how much effort to invest in test automation and manual QA.
On the other hand, if the cost of fixing bugs won’t occur until some time in the future, it’s harder to decide what testing is worthwhile right now because you don’t have regular feedback about the cost.
This moral hazard falls just as much or more on management. If they are used to timelines not including bug fixes, they’re liable to pressure developers to sacrifice test automation, which has even less visibility than bugs.
What to consider when prioritizing bugs
So far we’ve talked about bugs in an abstract sense, but concrete guidelines are necessary for putting abstract ideas into practice.
In reality, bugs have different severity levels, and fixing something that’s hurting customers right now is more important than addressing a latent issue.
Also, even if you accept the general idea of fixing bugs within one sprint, there will be varying priority levels within that time window.
And, you still have to draw the line between which bugs you fix, and which ones you mark as won’t fix.
With this in mind, an effective strategy for prioritizing bugs should consider the following principles.
Urgency avoidance
The current customer impact of a bug is usually pretty clear. What people often don’t think about, however, is the risk of a bug becoming urgent in the future if circumstances change.
A useful thought exercise is to think about what would happen if the bug came up on an important customer demo. What would that customer think? Would the demo be totally derailed? Would it give the customer a bad impression? Or, would they not care even if they noticed?
This is similar to making other ethical decisions. A lawyer friend of mine always advises his clients by asking: what would this look like if it were on the front page of the New York Times?
By the urgency avoidance principle, you should prioritize latent bugs just below how you’d prioritize them if they were actively affecting important customers.
This minimizes the number of bugs that become urgent in the future or resurface after being marked won’t fix.
Would you ship new code with this bug?
To avoid the moral hazard problem, you should ask yourself whether new code having the same bug would fail your organization’s quality standards. Is it something you’d fix if you knew about it before launching a new feature?
If the answer is yes, then you should fix it. Otherwise, whatever quality standard you claim to have will deteriorate because it is a double-standard for new and old code.
Consider internal costs
People usually account for customer impact when prioritizing bugs, but it’s important to look at impact on employees too.
Issues with internal systems like alarm noise, user tracking inaccuracy, or build system failures often take a back seat to customer problems because they don’t affect revenue.
However, internal problems can have a major impact on velocity, and development teams should be empowered to prioritize them accordingly.
Don’t mark a bug “won’t fix” unless you really won’t fix it
One way to follow the approach suggested here by the letter but not in spirit is to just mark bugs that aren’t having an immediate impact as won’t fix and wait for them to pop up again.
This defeats the purpose of a “fix all bugs” strategy. When deciding not to fix a bug, it’s important to think whether it will ever be something you want to fix without a fundamental change to your standards or resources and use the won’t-fix option judiciously.
Bug priority levels
So far we’ve focused on the decision about whether to fix a bug or not, but it’s also important to have different priority levels for bugs that you do decide to fix.
When establishing priority level guidelines, the goal is to balance urgency avoidance (since urgently fixing a bug is more disruptive) with addressing important issues quickly.
It’s also important to have clear and simple guidelines for priority levels so that everyone agrees about what qualifies for each priority level and how to handle each one.
I’ve had success with the following levels. You might use different names, but the important thing is what the priority levels mean.
Critical - Someone will be paged and start working on the bug immediately. Example: site outage.
Urgent - Stop working on whatever else you’re doing and fix it right away, but during business hours. Example: one customer is locked out of their account.
High - Start on it next after your current task, but within one business day at the latest. Example: a small group of customers can’t use a minor function of the software.
Medium - Complete it within one development iteration (i.e., sprint). Example: everything else you plan to ever fix.
Low - This is the won’t-fix status and you may want to close bugs with this priority level. If you do leave them open, everyone should have the understanding that they will not be fixed unless the opportunity arises to address them easily as part of another change, or if there is a major change in quality standards, resources, or business strategy.
The key thing to notice here is that there’s no priority level between “fix it within one sprint” and “won’t fix.”
There is no “gee, this could really bite us but maybe we can get away with putting it off for a few months” priority level, which is in line with the strategy of fixing all bugs.
Measuring results with a bug SLA
It’s one thing to talk about fixing all bugs, but it’s another to put the strategy into practice.
Reality is never absolute, nor should it be.
Processes are designed to handle the common case well, but there are always exceptions where processes don’t make sense. People need latitude to bend the rules sometimes. This provides the benefit of the process without imposing excessive rigidity.
Things also aren’t going to change overnight if you’re adopting a new process like fixing all bugs. Instead, you want to see consistent progress toward a goal.
A helpful metric for tracking bug-fix performance is an SLA (service level agreement). With an SLA, you define what portion of the time (e.g., 95%) you plan to meet the SLA target (e.g., fix a bug within 3 days).
You may choose different SLAs, but here is what we use for minware:
Critical and Urgent - 95% of bugs fixed within 24 hours
High - 95% of bugs fixed within 3 days
Medium - 95% of bugs fixed within 2 weeks
Calculating bug SLA resolution (BSLAR) metric in a spreadsheet
Once you have established SLA levels, it’s time to start tracking your bug SLA resolution (BSLAR) metrics.
If you’re using Jira, you can export all of your bug issues including the created at and resolved at times.
Once you have this data, you can create a pivot table that shows the percentage of bugs for each priority level that met the SLA target (the bug SLA resolution metric) and compare this to your goal ratio.
Another useful way to look at the data is to display percentiles, like median, 75%, 90%, 95%, 99%. This provides more detailed insight into how well you’re meeting the SLA and which actions you should take.
For example, if the 90th percentile is way under but the 95th misses your SLA, then you may want to focus on outliers that take the most time. On the other hand, if the median is really close to your SLA target, then perhaps there are broader issues like having too many bugs assigned to one person.
Additionally, you may want to create a board showing open bugs with swimlanes for each priority level. Looking at this board regularly (e.g., during each stand up) helps stay on top of open bugs rather than waiting until they show up as SLA misses in the report.
Automating the bug SLA resolution (BSLAR) metric
Calculation bug SLA metrics from exported Jira data in a spreadsheet has some limitations, and of course takes time.
In particular, the set of fields is limited so you can only see when the bug is created or resolved.
However, you may want to use different criteria for the start and end of the “open” time window. For example, bugs may sit in a post mortem status prior to being officially resolved. Or, you might want to start the clock when an issue is escalated to a high priority, not when it is filed.
It also might look bad if there’s a regression and a high-priority bug is reopened weeks after originally being fixed, so you may want to reset the timer for reopened issues.
We’ve created a bug SLA report in minware to automatically compute bug SLA resolution metrics. It looks at time windows when a bug was both open and set to a high priority separately for a single bug to avoid inaccuracies.
Conclusion
It may seem crazy at first for a team with limited resources to fix all their bugs before working on new features.
I have done it and it can be painful at first, but now I wouldn’t work any other way. All the headaches I experienced balancing priorities and planning work with a bug backlog have just gone away, and sometimes it’s easy to forget what it was like before.
If you have these headaches too, I recommend giving bug-free development a try for a few months, even if only for new bugs – I’d love to hear how it goes.
100% agree. I'm sure I read it somewhere but the anecdote I've gone back to is "What's the fastest way to be done with dinner?" Well the fastest way is to pile up the dirty dishes in the sink. That's great until you need to make dinner again, then you're trying to find the skillet and wishing you had washed it properly. Bugs are exactly the same way. Treating them like dirty dishes piled up is not a good plan. Further, software is abstract so it's hard for the business to see the reality of buggy code, and hard for engineering leaders to make the case for fixing alongside the hypothetical dollars assigned to new feature X.