How to solve a hard bug

Practicing good bug solving habits

Disclaimer: I am a backend dev for the most part. I know I’ve written this very much from a backend dev perspective. But I hope the concepts are generally applicable

I’ve noticed that I tend to solve bugs in an upside down manner or get endlessly lost in a bug fixing hell. In the end, I don’t solve them as fast as some of my code heroes. I want to be more like them. After working with them for a significant amount of time, I noticed the main thing I did differently to them: They had a disciplined strategy and I didn’t. I don’t have good bug solving habits and it’s time to build them in.

This is an attempt to burn the good bug fixing habits into my brain and perhaps help those who also solve bugs upside down like I do. I’d also like to learn even better habits, so those if you who have more advise or criticisms, please comment away — I’d appreciate it!

https://xkcd.com/1700/

My Old “Strategy”

Let’s not call it a strategy. When encountering a bug, I usually go off on a wild goose chase investigation into the data and logs and then investigate a piece of code on a hunch, and then stare at some other pieces of code, then try fix something, then try replicate it on the app, then write a (probably irrelevant test) and then try run it or push the “fix” into code. And the next bug, I will do something like that but in an equally irrelevant order. I sort of run around like a headless chicken until I fix them — rarely with a plan of action, and with loads of wasted effort

It’s time to build new good bug-fixing habits

Good Bug Solving Habits

1. Love Your Bugs

When you are assigned or notice a bug (or are working on a very bug-fixy sort of sprint), it’s important to put your debugging hat on. Get excited about the bug and the wisdom it hides. As Allison Kaptur says “Love your bugs” (I recommend the read). Don’t try rush through it, don’t try quick fix it, don’t assume you know what the problem is. Slow down. Breath. Smile. Start from the beginning. Revel in your discoveries.

I love bugs because they’re entertaining. They’re dramatic. The investigation of a great bug can be full of twists and turns. A great bug is like a good joke or a riddle — you’re expecting one outcome, but the result veers off in another direction.
Allison Kaptur ~ “Love your bugs”

2. Write a Failing Test

Write a test that tries to reproduce the state where the bug occured. Even if you aren’t an avid TDDer, do this one thing. If you can’t reproduce it reliably, you can’t know if you’ve fixed it.

That said, writing a failing test isn’t always easy for a complex bug. Let’s say you write a test based on the following scenario from a user/tester: “When I edit my name, and save, I get a 500 error”. Perhaps you write a test that checks that the error does not occur and a name is saved appropriately. And you run said test, and test passes! Now what?

Well, that’s a sign that you aren’t replicating your state properly — it’s a sign more investigation needs to happen as you do not understand the extent of that bug. So, you put on your investigation hat and go look deeper: you read the error message in the logs, or you go to the user who caused this and ask them what they set their name to.

You discover the user tried to save a name of 100 characters with unicode characters. So you update your test to more accurately reflect what the user did.

Aha! Test is successfully failing! Perhaps you have succeeded. From there, a bug like that would be an easy thing to fix. But what I want to highlight is THAT process:

  1. Write/Update test to replicate problem with given information
  2. If can’t replicate problem, investigate more key information about the state and actions and add that to the test
  3. Repeat, until test successfully fails

What happens to me is I sometimes end up in a tizz, trying to figure out what I am missing in replicating my bugs. So I’ve come up with a sort of order that seems to work.

  1. Read the bug report (or talk to the tester/user). Do they mention details? Is the time logged? What details can they tell you that can help you search the logs? User ids. etc?
  2. Read the logs. If an exception was thrown, it’ll be there. If you have good logging then you may be able to look for very specific information about the scenario (perhaps the time, or the specific actions, or the specific users)
  3. Read the data. Go hit up the database with some queries in order to understand the state that the system is in. The what, when, who and how.
  4. Deep dive into the architecture. Did a process stop? Did the system run out of memory?
  5. Ask someone. Explain to them what you’ve learnt and often via rubber-ducking, you’ll find the answer yourself

What if I can’t replicate? Then I must need some more information! Often, this is where I realise my logging is not useful enough and improve it. After improving my logging, I speak to the user or tester (or in dire straights whip out the app/website/system myself) and ask them to help me try replicate. If still it can’t be replicated, then the bug must, sadly, be parked until someone can.

https://za.pinterest.com/pin/697213586035387615/

3. Identify why it is failing

This is where your debugger comes in. Whether it’s gdb or a Visual Studio .NET debugger (or if you live in primitive world where there is no viable debugger, then print out statements). My general strategy for debugging is as follows:

  1. Simply stepping over each line of code to understand the path
  2. Adding in break points at important state or flow changing points.
  3. Checking the state (in the variables, database and logs) before and after each key decision making point

Usually, this will help me isolate a line of code, or a single database query, and from there the fix is usually quite obvious.

Debugger’s these days are super powerful. Conditional breakpoints and the like. I will do a future post of debugger’s once I’ve levelled up my debugger wizardry to Dumbledore level.

Additionally, for more complex bugs, you may need to whip out the profilers to understand why something is slow or using too much memory. Getting better at using profiler’s is one of my goals this year, so hopefully I will write a post about that sometime this year.

4. Fix the Bug

You know it’s fixed when the test is passing. (Well, most of the time anyway. You get those bugs which occur on random — or those badly written tests. So it’ll fail sometimes and pass sometimes and you only realise after the 10th time running your tests that there is a problem — we call them “flippers”). Y

Now, I’m assuming you’re running your tests consistently as part of some CI process and that they’re all run and passing before your code goes live. If so, hopefully with a half decent code coverage, you’ll know what other parts of the system your code may have touched. If it does cause tests to start failing, you’ll need to sort out any run-on effects of fixing the bug too.

5. Once your fix is in test and/or production environments retest

Often, similar scenarios can cause what appears to be the same bug. It’s important to try retest in production where possible.

Thank you for reading

So those are my good bug solving habits that I have identified so far. Please share your bug solving habits, and highlight any points or perspectives I may have missed — I want to be better and learn more. I’d also like to improve my writing, so any general constructive criticism and grammar-nazi’s are also appreciated.

About the author

Jade Abbott profile picture

Jade Abbott

Jade Abbott

I’m the ML lead at Retro Rabbit where I’ve worked at every end of putting an ML model into production. By night, I lead Masakhane, a grassroots open research movement for African language technologies Read more from Jade Abbott...