Thoughts on Bugs - Thoughts on Software Development

This is part of our discussion about waste. Previously, we talked about feature waste.

Now, let’s talk about another kind of waste right there at the top of my list: product defects.

Defects Are Wasteful

They are so obviously wasteful. They drain a massive amount of time to understand and investigate.

During investigation, they require the engineer to drop and delay everything else they are working on, which directly delays or decreases the value the customers will receive.

Also, the most productive engineers of the organization will be exactly the ones dragged into an investigation and fix. Multiple of them are needed on the difficult bugs.

Then you have downstream effects in the deal pipeline, with customers delaying their buying decisions, or buying a competitor instead.

Finally, depending on how widespread the issues are, it damages the image of your company, which can be irreversible. Those Reddit posts complaining about your product won’t erase themselves.

Defects Are Inevitable

Yet, bugs are impossible to avoid.

Defects normally happen due to unexpected behaviors that the engineer did not take into consideration when initially building the product.

It is very common, for example, that a bug will happen at the interface between modules inside the product, or between the product and its external integrations, or between two modules that share the same resource pool. These are places that can constitute “blind spots” for a developer.

Bugs then will show up whenever the engineer that built the code did not have a correct or complete understanding on how the product would be used, or did not know how different modules of the product would interact.

Since the defect is born from missing information at the time the code was produced, often the same information is also missing at the time the code is tested. Then often tests will miss the most pernicious bugs.

There are several techniques to reduce the probability of a bug escaping into a high impact environment. We can go over some of them in subsequent posts.

The probability is never zero, however. A complex code base will have very complex internal and external interactions. The most stringent test strategy will still allow some defects to leak.

Defects Are Necessary

When used properly, defects are incredible learning tools.

A bug will not only uncover a specific lack of understanding of the engineer that wrote the code in the first place. It will likely also show a series of missing or incorrect control processes in your software production pipeline.

It will elicit missing tests, or incorrect code review procedures and guidelines, incorrect product design and architecture, and even incorrect management practices.

With the right feedback loops, a defect can trigger the improvement of control processes that will not only prevent this one specific defect from happening again, but an entire class of defects from happening again.

This is essential for a company to quickly sharpen their tools, especially if they are still growing and evolving their product.

Dealing With the Paradox

Defects are very wasteful, yet they are necessary.

What is the right balance that allows your engineering organization to continue to learn without either getting buried under defect waste or getting stifled by an overly cautious development pipeline?

Since there is no such a thing as defect eradication, we will start by embracing the fact that bugs will always exist. We will learn to co-exist with them, and, most importantly, we will learn with them.

In our industry, this is best done using blameless postmortem processes that can loop back into every process of the organization, ideally even beyond engineering itself.

(If you had to support a product that was not ready to be widely marketed, you know what I mean.)

This must be something sponsored by the very top of the organization, and it should be embraced bottom-up as a central part of the culture.

In my experience, postmortem meetings are much more effective in aligning the team culture than all hands or team outings. People really get engaged when they have a real problem in their hands and are given the opportunity to speak up.

Next, since we cannot control if a bug will happen, the second best thing we can do is to control where and when they will happen.

If we (correctly) assume that every single software release will carry hidden bugs, despite all of your effort to stop them, then it is natural to increase the control of who will see the new software, when, and under what expectations.

We talked about this when discussing about feature waste:

the perceived quality of your product is the number of issues it has times the probability that someone will see them

For example, I am sure that the majority of your customers don’t really need that new shiny feature immediately. Every single customer will have a different balance of value generated by the new code vs. the negative impact of a defect.

Also, not all defects are born equally. Some defects are very hard to recover from, such as persisted corruptions. These will require not only the code to be patched, but all of the persisted data to be fixed as well.

Given this very nuanced benefit vs. risk balance, the impact of a defect will depend of a number of dimensions. In most cases, these are dimensions you can use to partition your rollout strategy, so that you can decide when a specific piece of code will land into a specific location.

Handling the Inevitable

If we assume bugs are inevitable, it follows that we should develop not only mechanisms to prevent bugs from happening and from being seen, but also mechanisms to quickly recover when they inevitably happen.

There are numerous mechanisms we can use to quickly recover, such as feature flags, configurations, rollbacks and other quick patch releases.

In this context, persisted changes are specially concerning, since they might cause a bug, such as a corruption, to become persisted into the solution. This can cause a long recovery process, or even permanent loss of data.

In order to quickly recover from them we often need to take advantage of backward compatibility of the persisted data structure. Depending on the risk, it might even make sense to store the data twice for some period of time, once using the previous code, and once using the new code.

Also, the software architecture can be designed in such a way that contains the impact of specific failures on specific modules. A correct compartmentalization of the software allows for example micro-deployments and micro-rollbacks of one specific module, controlling the impact, flexibility and speed of releases.

Transparency and Expectations

Finally, the way the product is released, marketed and how it is presented can make a huge difference on what is the expectation the customer will have whenever a customer receives a new piece of code.

There should be transparency in the benefit vs. risk tradeoff across departments of a company. While the benefits of a release are very easy to gauge and communicate across product, marketing and sales, risks are often hidden within engineering and support organizations.

From the point of view of a salesperson, good engineers should be somewhat depressing to talk to. This means that they are treating you like an adult. They should be able to properly understand and communicate the risks of a piece of code that is being deployed.

Moreover, the correct expectations also need to be set to the customers. Unstable or experimental features should be communicated as such. The product itself should be built in a way that correctly communicates them.

Keys

If you want to increase the learning rate of your company, achieve higher quality, agility and higher customer value, try:

Setup a proper blameless postmortem process that can map defects to learnings;
Shape the software development control processes, the product architecture and management practices based on the postmortem feedback loops;
Focus less on specific failures, and more on the class of similar failures produced by the flawed development pipeline;
Instead of only controlling if a bug will happen, control when, where and under what expectations they will happen;
Focus less on trying to zero out bugs, and more in the contingency plans for when they inevitably will happen;
Risks should be correctly communicated across product, marketing, sales and customers;
Reducing the impact of defects is as important as reducing the number of defects.