Software, as any complex product, will have defects. Plain and simple, no point in denying the facts. That said, the question is how can we handle them. First and foremost we should obviously strive to have as few as possible but the question I want to talk about here is what do we do with those that we didn’t avoid.
Classifying bugs
Let’s take a step back first and see what really is a defect. From an end user point of view, anything that is present but doesn’t work as the user wants/expects is a problem. Some of those will be cleared by explanation, others by declaring it a change or new feature and some will be proper problems with how the code behaves. The second and third option there share a rather large grey area and at times flipping a coin is as good a tool as any to decide, so I won’t go into that either. Suffice to say that after some deliberation you’ll end up with problems that you deem related to unintended behaviour of the code itself. It’s those that I want to further classify in 3 subdivisions:
1. Legacy bugs
Legacy bugs are the result of legacy code. Legacy code is all code that has been around long enough to not be considered new anymore (either developed in the project or inherited from previous projects). Generally I stick a 1 year age limit on that. If functionality has been in the software for a year I find that it is not relevant anymore in terms of gathering statistics. This irrelevance is generally down to team changes and the fact I would hardly ever go back beyond 12 months for metrics calculation like velocity.
2. New bugs
As the name suggests, these are bugs that have been found recently, in code that is not legacy. A bug will remain in this ‘new’ state for some amount of time. I find that 2 to 3 iterations is a good working average (in 2 week iterations) but that is very context sensitive.
3. Deferred bugs
These are new bugs that have “expired”. So it’s essentially a new bug that didn’t get fixed while it was still considered new.
During a sprint
In this model a sprint’s content is conceptually divided in 2 pieces. A first part is items from the backlog. These could be stories, research tasks, legacy or deferred bugs. Basically anything that has an estimated amount of storypoints on it and that the product owner put as a high enough priority so it fits within the available capacity for the sprint. The other part is a % of slack that is built into the system to absorb new bugs (and depending on your situation perhaps even critical support calls). I find that a good way to manage slack is by applying some elements of KanBan to it.
As bugs come in they enter an input queue, where the product owner can give them a priority (by ordering or classes of service or combination of that). Generally a first step is analyzing the bug after which the PO can decide to either defer the bug or keep it in the flow for “immediate” fixing.
The other element in this story is that bugs that are in the input lane for longer than a set time (2 to 3 sprints sounds good, depending on your sprint length) the bug is automatically deferred. The logic being that if it was important enough it would have been fixed in that time and you don’t want your system being dragged down by low value work. As you want bugs to get fixed rather than being in limbo I advise to keep WIP limits extremely low in this system.
So summing it up, you could say that the team actually pulls work from 2 systems. On the one hand there is the scrum process which runs as usual but within a designated section of the sprint there is actually a KanBan system working behind the scenes.
Getting on with planning
From a planning point of view the Scrum rules remain. All work on the backlog needs to be estimated and prioritized. So legacy and deferred bugs are estimated just the same as other items on the backlog. They are just another input, no different from customer requirements in the way they are handled. During sprint planning they are scheduled as normal part of the sprint. So don’t go scheduling them in the slack as that is specifically designed to handle new bugs.
Estimating bugs is generally not easy, so allowing the team to analyze enough to be able to estimate is a good idea. For legacy bugs and bugs that were deferred because they weren’t picked up this falls under the normal backlog grooming activities, bugs that were explicitly deferred after analyzing shouldn’t need much more research. The idea is to get just enough info so the bug can be estimated, not to go off and fix it right then and there.
Another point is that bugs vary much more in size and complexity than a typical story, so particularly for those very small bugs it can be good to just bunch up a few and call it 1 storypoint rather than trying to estimate each one. This will keep your velocity a lot more representative for the work that is actually done.
The amount of slack that you build in is a trade off that only the PO can decide on, on a per sprint basis, as it clearly represents an amount work that the team has to do. It will always come down to a choice between making forward progress (i.e. doing ‘new’ stuff from the backlog) and keeping the amount of technical debt in check. Should the PO decide to not have any slack at all it stands to reason that all bugs will automatically be deferred.
Another word on the classification
There is one last element missing in the above classification, which is bug recurrence. Understanding what kinds of bugs you get and which ones return (or in which areas of the code if you take it a bit broader) is very helpful in driving quality focus but also in planning.
A simplified example: Suppose you measure that about 5% of bugs return after each round of QA and/or user acceptance. That means if you have a log of 100 bugs you need 2 passes. There will be 5 left after the first pass and 0.25 after the second, meaning the product can be deployed. On the other hand if your rate would be 30% you’re looking at 4 rounds of that. Understanding this pattern is a big help in avoiding those bug squashing weekends right before a release.
The last word is for the user of your software
From the user’s point of view it doesn’t really matter what type of bug you have or how you prioritized it, all that matters is that their problem is solved. That is a very important thing to keep in mind as that is the primary measure for defect handling imho. How many defects are you getting, how many are recurring, and particularly how fast do you get the solution deployed to your users. People who forget to measure that last one (actual deployment, no earlier semi state) are missing the main aspects of handling defects.