Bugs at Zoosk from the VIP point of view!

Consider the following three simple facts.

  1. Continuous Improvement is good. People love it. It’s simply build, fail, and learn. And in order to learn faster, you have to build faster. But you might also fail faster—even faster than expected.
  2. Everybody is born with a passion for learning, and also a level of perfectionism. They will carry these things with them throughout their lives. They will want to prevent bad things from happening, as much as possible. And bugs are bad. The existence of bugs means things have gone wrong.
  3. Life is short.

As a developer who deals with these facts you can make a mistake and tell everyone that you are learning. (You are learning so fast and it’s only natural to make a few mistakes!) But when you’re honest with yourself, you can see you are surrounded by bugs. So many bugs! So many nasty bugs that you think even taking a shower three times a day is not enough. So many bugs that it feels like you’ll never have enough time to fix them all.

You will never reach your destination if you stop and throw stones at every dog that barks.

– Winston S. Churchill

Bug #81550 at Zoosk

At Zoosk, like many tech companies, we have bugs. Many that we have to live with. Sometimes we live with them to the extent that we’re cohabitating and we even know them by their Bug ID.

The other day I was chatting with Jay Kremer, Zoosk’s Director of Quality Assurance, in the parking lot behind the office when I asked him, “Why does every new QA tester gets so excited when they find Bug #81550 in production?” Bug #81550 is a bug that’s well-known at Zoosk. We’ve had it for a while and have yet to fix it.

It wasn’t a surprise that in his reply Jay didn’t ask me, “Why doesn’t your team fix it every time we file a ticket for Bug #81550.” It wasn’t a surprise because I know that Jay is more familiar with Zoosk’s principles for addressing bugs than anyone. It might be a good performance indicator for people on his team. I’m not sure. But what I am sure about is that his team finds the bugs and the engineering team addresses them with help from our VIP guidelines.

VIP guidelines and how we paint our Bug

Life is short. So we should prioritize everything, including bugs. There are prioritizing models similar to P1 – Blocker, P2 – Critical, P3 – Major, and P4 – Minor. But should we always fix every blocker before sending something out? Should we sort the rest of the bugs based on their priority order and then go through them linearly? What if a bug isn’t that severe and is only happening during an extreme edge case, or during an imaginary scenario? What if it’s a Heisenbug and it takes our whole engineering team just to study its behavior?

At Zoosk, our bugs live in a three dimensional world. We have three different dimensions we take into account when we look into a bug and classify it. Volume, Impact, and Price. That’s VIP. Volume is the frequency of that bug happening in production. Impact is how the user’s experience is going to be affected. Price is the cost to eliminate that issue.

The destiny of a bug at Zoosk is determined by how high or low it is in each of these three dimensions.

Class 7 = (111)₂ – High Volume, High Impact, High Price:
Whitelist, Duct Tape, or Compromise

ducttape

Since there is no book entitled How to Fight with a T-Rex in Javascript we have to get creative here. Our first effort is usually to focus on and lower one of the dimensions so that the bug falls into one of the other classes.

  1. To lower the volume, the feature can be rolled back to a limited number of users (whitelisted) while we keep investigating.
  2. To lower the impact, one option is to use the nastiest duct tape available for most programming languages—Pokémon Exception Handling (aka, the Diaper Pattern.)
  3. To lower the price and cost of the fix, we can get together as a group to come up with alternatives. This can be really helpful but people have to be ready to compromise something. Engineers might need to compromise on the quality of the code, show some flexibility, and let the anti-patterns flow in. Product Managers (PMs), on the other hand, might need to simplify some aspects of the product or feature so that it’s less complicated. Think about it—if you have a super complicated login splash screen that causes your app to crash on the majority of devices and no one knows exactly how it works then it might be best to simplify it! Keep in mind, teamwork plus simplicity always wins.

Class 6 = (110)₂ – High Volume, High Impact, Low Price:
Keep Calm and Just Fix It

keepcalm

These are the best bugs in the world. They have all the traits that a bug needs to die fast. They are high volume, so they will get immediate visibility; they have a high impact, so everyone agrees they should be addressed immediately; and they have a low price, so you don’t need to hire a professional killer to shut them down.

Keep in mind that because these bugs are high in the first two dimensions, they have a high potential to bring up the third dimension as well. Junior developers may start ordering their Zombie Apocalypse Preparedness Kits. Other departments may start judging the quality of the deliverables, and consequently people start blaming and blame shifting. Managers might think the bug needs three super heros working on it at the same time and, ultimately, it may freak out all the decision makers in the hierarchy. Don’t let that happen. Don’t let the first two highs fool you. Keep Calm and Just Fix it!

One good practice that we have at Zoosk is that we hold a Continuous Improvement (aka blameless retrospective) meeting after any high volume and/or high impact bug. This meeting usually involves developers, managers), QA team members who worked on the bug, and veteran engineers. The whole goal of the meeting is to first review what we did to identify, scope, and fix the bug (it’s joyful to review our great accomplishments). Then we will see how we could do better to prevent this issue from happening at all. Outcomes of these meetings are sent to the team members and engineering leads. Also, we document all valuable and common findings of these CI meetings in an internal Best Practices wiki that we not only share with our new hires, but also use to create checklists for future processes.

Class 5 = (101)₂ – High Volume, Low Impact, High Price:
Classify the Noise

firstcomeslikeThis reminds me of living in downtown San Francisco, where our office is: It’s noisy, but you get used to it. Volume is high, and installing noise barriers everywhere is expensive. However, the impact (or to be precise, immediate impact) is low.

Living with these bugs is quite feasible. Some people even develop life skills based on this mindset about remaining focused in a busy environment, or accepting the small imperfections in life and focusing on what’s really important. However, the caveat here is that when you get in the habit of tuning out some unimportant noise you may prevent yourself from hearing important noise too. That’s why this class of bug is the biggest predator of Class 2 and Class 3 bugs in the ecosystem. So you should put these loud wolves in an isolated island and save the rabbits!

As an example, let’s say we have an average of 10 errors per minute from random sources. Since Zoosk is a global service, the traffic and volume of all our interactions goes up and down during the day by about 20%. So, as long as the error rate is between eight and twelve per minute we are good. However, when a new noisy low-impact error shows up with a frequency of 100 instances per minute our expected range is going to be between 88 to 132. (It is 100 ± 20% for this new error in addition to 10 ± 20% for the existing ones.) That makes it very difficult for us to detect if a new error (especially a low volume, high impact error) shows up with a rate of seven per hour.

The way we handle these cases at Zoosk is to classify the errors. Most error tracking systems have the functionality that allows you to label known issues and separate them from what you watch every hour in your dashboards. We have a quick and easy way to classically classify our errors—we catch the exception/error in the narrowest possible way and re-throw a new clone of that error object, piggybacked with a unique identifier for it. With this information, we are able to reclassify these bugs and deal with them appropriately.

Class 4 = (100)₂ – High Volume, Low Impact, Low Price:
Let the Ninja Dance

ninjaClass 4 bugs aren’t as important as Class 6 (High Volume, High Impact, Low Price) bugs and aren’t as expensive as Class 5 (High Volume, Low Impact, High Price) bugs. Because of this, they can wait for a few days. 

Since the price is low, chances are the effort it takes to classify the bug would be even more than what it takes to fix it. That’s why we usually schedule these bugs to be fixed in the next iteration of the project or feature. 

Also, you may find out that there is an opportunity here. If you have some ninja coders, this is the time to challenge them. You can assess the expertise they have built since joining your organization and see if they can quickly fix the bug. You may even want to make it a challenge for the team and inspire some ninja-wannabes! See who can shut this down in less than an hour!

Class 3 = (011)₂ – Low Volume, High Impact, High Price:
Loch Ness Monster Exists

nessyIt’s a Loch Ness Monster Bug or Bugfoot! Believe it or not, these bugs do exist. An example of one such bug we had at Zoosk was Bug #90870: the tutorials that we were showing during the initial sessions of our features were showing up twice. And guess who filed it?—Our co-founder/CEO! After spending a lot of investigation effort on it, our developers couldn’t reproduce the bug and thought it may have been due to the fact that our founder’s account was super old (it was created in 2008 when Zoosk was founded). They thought it may have had some weird conditions and settings that were not applicable to our new users. OK, fair enough. Not to mention that our server-side logging showed that the bug happens only once every 10,000 instances.

But then the monster showed us its next gimmick! It happened to a new CEO who had a new account registered in December 2014. It happened again! Yes, Loch Ness Monster Bugs can be dancing cats, and sometimes only show up for the C-level people of a company!

Our practice for such monsters is to install a radar system with lasers and infrared all over our Loch Ness area, so the next time it shows up again all of our developers will jump to take a selfie with it!

  1. You’re going to have a bad day if your Loch Ness Monster Bug learns how to become a Heisenbug!

snacksClass 2 = (010)₂ – Low Volume, High Impact, Low Price:
Snacks

Snacks.

They can wait because they are low volume. They are challenging and have enough of an impact that they give a high level of satisfaction back to the developers when they are eliminated. And they are low price so developers are not going to get stuck on them for a long time. That’s why we call them snacks.

We assign a couple of snacks to everyone every now and again whenever we have a gap between our projects. It works great. Especially when we shuffle the Class 2 bugs around developers it helps us spread domain knowledge and improve expertise. For example, if Alice has completed and shipped project A and Bob has completed and shipped project B, it’s a great practice to let Bob fix Class 2 bugs for project A, and vice versa.

Class 1 = (001)₂ – Low Volume, Low Impact, High Price: Higgs-bugsons

bugsonDon’t let your most awesome developers know you have a higgs-bugsons! Or both the developer and the bug will be half-faded for a long long time.

Supporting fancy UIs on legacy devices/browsers usually fall in this class. (Didn’t we say life is short?) Won’t fix is our reason for closing these bugs. We also add a note to let the bug opener know that sprinting with 98% quality is better than walking slowly with 98.002% quality. That’s a numerical version of “Done is better than perfect.

Yogi Berra also put this a great way: “If the world were perfect, it wouldn’t be.” And that’s all you need to tell people when explaining why some bugs are okay if they’re immortal in your product. Chances are that over time they will turn into features that users accept, or even bloombugs!
Side Note: These bugs are dangerous. If a passionate and skilled developer finds one of them, they will cross their arms and upload a challenge-accepted meme figure on the ticket and get to work. Chances are you will find that same developer staying late and wrestling with the bug for the next couple of sprints which could affect morale if the person can’t fix it. Don’t let that happen! Ever.

Class 0 = (000)₂ – Low Volume, Low Impact, Low Price:
Dumbbells

dumbellsFixing a minor punctuation typo in a less frequent language, out of the 26 that we support, is a good example of this kind of bug.
Should we respond Won’t fix again? Nope! Why should we spoil the best way to train new hires?! We label these bugs as “#newhire” and use them as 5lb dumbbells to help our future great developers learn the product and how we work. They need to start from 5lb and work hard, so eventually they can lift heavy bugs on half a million lines of Javascript and HTML per each of our Single Page Applications!

If Knowing the Bug Is 50% of the Fix, Identifying Its Class Is 90% of the Other Half!

Panic, stress, embarrassment, fear, and anger are common symptoms of realizing a bug exists in your system. Take a cold shower to cool off and then make the right decision. You can refer to the following cheat-sheet to remember what we do at Zoosk.

Final Cheat-sheet

Last but not least, keep in mind that a bug is not a problem; it’s an opportunity for you to improve your development process! Just like any other incident in life, dealing with why it’s happened is not going to solve it, but thinking about what should be the best next move can help you serialize taking the next best move one after another, and ultimately help you conquer the bug!

Credits

Thanks to the direct Product Manager of my team, Anton Chakhmatov, who inspired me to classify our Bugs in this 3D format!  Also thank you Megan Murray, Jay Kremer, Brian Backhaus, and David Harnois for their feedback.