How do you learn to replicate bugs, when they happen inconsistently
I don’t have definitive advice here, I think this is a hard problem no matter your skill level. You can do things in advance to make your program more debuggable, like better logging, and assertions so you catch the bug closer to the root cause.
A more general pattern to look for is some tool that can capture a particular run of the system in a reproducible/replayable manner. For a single program running locally, a coredump is already quite good, you can look at the whole state of your program just before the crash. (E.g. the whole stack trace, and all variables. This can already tell you a lot.) I have also heard great things about rr, supposedly it allows you to capture a whole execution and single step forwards and backwards.
For distributed systems, like web applications, the problem is even harder. I think I have seen some projects aiming to do the whole “reproducible execution” thing for distributed systems, but I don’t know of any that I could recommend. In theory the problem should not be hard, just capture all inputs to the system, and since computers are deterministic, just replay the inputs. But in practice, given the complexity of our software stacks, often determinism is more of a pipe dream.
How does one “read the docs?”
Something something “how to build up a model of the entire stack.”
I think these are closely related. I imagine my “model of the entire stack” like a scaffolding with some knowledge holes that can be filled in quickly if needed. You should not have any unknown-unknowns. If I notice that I need more fidelity in some area of my model, that’s exactly the docs I read up on.
When reading docs, you can have different intentions. Maybe you are learning about something for the first time, and just want to get an overall understanding. Or maybe you already have the overall understanding, and are just looking for some very specific detail. Often documentation is also written to target one of those use-cases, you should be aware that (well) documented systems often have multiple of these. This is one model I came across that tries to categorize documentation (though I am not sure I subscribe to these exact 4 categories):
Getting back to the “model of the entire stack” thing, I think it’s very important for how (I at least) approach computer systems. I think this article by Drew DeVault in particular was an important mindset-shift back when I read it. Some quotes:
Some people will shut down when they’re faced with a problem that requires them to dig into territory that they’re unfamiliar with. [...] Getting around in an unfamiliar repository can be a little intimidating, but do it enough times and it’ll become second nature. [...] written in unfamiliar programming languages or utilize even more unfamiliar libraries, don’t despair. All programming languages have a lot in common and huge numbers of resources are available online. Learning just enough to understand (and fix!) a particular problem is very possible
I now believe that being able to quickly jump into unfamiliar codebases and unfamiliar languages is a very important skill to have developed. This is also important because documentation is often lacking or non-existent, and the code is the “documentation”.
Also, I feel like the “model of the entire stack” thing is a phase shift for debugging once you get there. Suddenly, you can be very confident about finding out the root cause of any (reproducible) bug in bounded time.
If at any point you notice that your unfamiliarity with some part of the system is impeding you in solving some problem, that’s a sign to study that area in more detail. (I think this is easier to notice when debugging, but can be equally important when building new features. Sometimes your unfamiliarity with a certain area leads you to build a more complex solution than necessary, since you are unable to search paths that route through that area. A map analogy here would be you having a dark spot on your map, and noticing whether it’s likely that between two points, there could be a shorter path through the dark area.)
I don’t have definitive advice here, I think this is a hard problem no matter your skill level. You can do things in advance to make your program more debuggable, like better logging, and assertions so you catch the bug closer to the root cause.
A more general pattern to look for is some tool that can capture a particular run of the system in a reproducible/replayable manner. For a single program running locally, a coredump is already quite good, you can look at the whole state of your program just before the crash. (E.g. the whole stack trace, and all variables. This can already tell you a lot.) I have also heard great things about rr, supposedly it allows you to capture a whole execution and single step forwards and backwards.
For distributed systems, like web applications, the problem is even harder. I think I have seen some projects aiming to do the whole “reproducible execution” thing for distributed systems, but I don’t know of any that I could recommend. In theory the problem should not be hard, just capture all inputs to the system, and since computers are deterministic, just replay the inputs. But in practice, given the complexity of our software stacks, often determinism is more of a pipe dream.
I think these are closely related. I imagine my “model of the entire stack” like a scaffolding with some knowledge holes that can be filled in quickly if needed. You should not have any unknown-unknowns. If I notice that I need more fidelity in some area of my model, that’s exactly the docs I read up on.
When reading docs, you can have different intentions. Maybe you are learning about something for the first time, and just want to get an overall understanding. Or maybe you already have the overall understanding, and are just looking for some very specific detail. Often documentation is also written to target one of those use-cases, you should be aware that (well) documented systems often have multiple of these. This is one model I came across that tries to categorize documentation (though I am not sure I subscribe to these exact 4 categories):
Getting back to the “model of the entire stack” thing, I think it’s very important for how (I at least) approach computer systems. I think this article by Drew DeVault in particular was an important mindset-shift back when I read it. Some quotes:
I now believe that being able to quickly jump into unfamiliar codebases and unfamiliar languages is a very important skill to have developed. This is also important because documentation is often lacking or non-existent, and the code is the “documentation”.
Also, I feel like the “model of the entire stack” thing is a phase shift for debugging once you get there. Suddenly, you can be very confident about finding out the root cause of any (reproducible) bug in bounded time.
If at any point you notice that your unfamiliarity with some part of the system is impeding you in solving some problem, that’s a sign to study that area in more detail. (I think this is easier to notice when debugging, but can be equally important when building new features. Sometimes your unfamiliarity with a certain area leads you to build a more complex solution than necessary, since you are unable to search paths that route through that area. A map analogy here would be you having a dark spot on your map, and noticing whether it’s likely that between two points, there could be a shorter path through the dark area.)