I don’t actually know if I count as a “senior developer”, but I’m pretty convinced I count as a “senior debugger” given the amount of time I’ve spent chasing problems in other people’s often-unfamiliar code.
These questions feel really hard to answer, mayber too general? When I try to answer them, I keep flying off into disconnected lists of heuristics and “things to look for”.
Also, you don’t give any examples of problems you’ve found hard, and I feel I may be answering at too simplistic a level. But...
How do you learn to replicate bugs, when they happen inconsistently in no discernable pattern?
The thing is that there is a discernable pattern. Once you’ve fixed the bug, you’ll probably be able to say exactly what triggers it.
If you can’t (yet) reproduce the bug, think about states of the program that could have led to whatever you’re seeing.
You know what routine detected a problem[1], and you usually know what routine produced bad output (it’s the one that should have produced good output at that point).
Your suspect routine usually uses a relatively small set of inputs to do whatever it does. It can only react to data that it at least examines. So what does it use to do what it does? Where do its inputs come from, not necessarily in the sense of what calling routine passes them in, but in the sense of how they enter the overall program? What values can they take? What values would they have to take to produce the behavior you’re seeing? How could they end up taking those particular values? What parts of the program and environment state are likely to vary, and how will they affect this routine?
Very often, you can answer those questions without getting bogged down in the call graph or having to try a huge number of cases. Sometimes you can not just reproduce the bug, but actually fix it.
If function X is complaining about a “font not found” error, then it’s presumably looking up some font name or other identifier in some font database. There probably aren’t that many places that the font identifier can be coming from, and there’s probably only one font database.
If you can say, “well, the font should either be the default system font, or a font specified in the user profile”, then you can make an intuitive leap. You know that everything uses the default system font all the time, so that path probably works… but maybe it’s possible for a user profile to end up referring to a font that doesn’t exist… but I know the code checks for that, and I can’t set a bogus font on my own profile… but wait, what if the font gets deleted after the user picks it?
Of course, there are lots of other possibilities[2]. Maybe there’s some weird corner case where the font database isn’t available, or you’re using the wrong one, or it’s corrupted, or whatever. But it’s unlikely to be something completely unrelated that’s happening in some function in the call stack that doesn’t use the font information at all.
Or maybe function X is searching by font attributes instead of names. So where might it be getting extra constraints on the query?
Or maybe function Y is blowing up trying to use a null value. The root cause is almost certainly that you made a poor choice of programming language, but you can’t fix that now, so persevere. Usually you have the line of code where it choked, but say you don’t. What values are in scope in Y that could be null? Well, it gets called with a font descriptor. Could that be null? Well, wait, the font lookup routine returns a null value if it can’t find a font. So maybe we have a font not found error in disguise. So try thinking about it, for a limited time, as a font-not-found error.
Or maybe “fnord!” is showing up randomly in the output. So where could it come from? First step: brute force. grep -ir 'fnord' src. Leave off the exclamation point at least to start. Punctuation tends to be quoted weirdly or added by code. If it’s not there, is it in the binary? Is it in the database? Is it in the config file? Is it anywhere on the whole damned system? If not, that leaves what? Probably the network.
In the end, though, there’s also a certain amount of pattern matching against experience. “Code that does X is usually structured like Y”. “Does this thing have access to the files it needs?”. “Programmers always forget about cases like Z”. “I always forget about cases like W”. “Weird daemon behavior that you can’t reproduce interactively is always caused by SELinux”.
Once you do reproduce the bug, you can always just switch to a strategy of brute force tracing everything that goes on anywhere near it, with a debugger, with printing or logging, with a system call tracer, or whatever. But, yeah, you’ve got to get it to happen if you want to trace it.
How does one “read the docs?”
It depends on what you’re trying to find out.
I find I mostly use two kinds of documentation:
Architectural stuff. General material that explains the concepts running around in the code, the terminology, what objects exist, what their life cycles look like, etc. This sort of documentation is often either nonexistent, or so bloated and disorganized as to be useless for quick debugging, and maybe useless period. But if it exists and is any good, it’s gold. If you’re trying to educate yourself about something that you’re going to use heavily, you may just have to slog through all of whatever’s available.
API documentation, ideally with links to source code. I usually don’t even skim this. I navigate it with keyword search. If you want to now how to frobnicate a blenk, then search for “frobnicate” and whatever synonyms you can come up with[3]. If you’re trying to debug a stack trace from a library routine, look up that routine and see what parameters it takes.
In my never especially humble opinion, “tutorials” are mostly wastes of time beyond the first couple of units, and it’s unfortunate that people concentrate so much on them instead of writing decent architecture and concept documentation.
One important part of the skill is allocating your time and attention, and not getting stuck on paths that aren’t bearing fruit. You can always come back to an idea if nothing else works either. The counter-consideration is that if you don’t think at least a little bit deeply about whatever avenue you’re exploring, you’re unlikely to have a good sense of how fruitful it looks. So you have to balance breadth against depth in your search.
If the architecture documentation doesn’t suck, you can get likely terms by reading it. Or ask an LLM that’s probably internalized it. Otherwise you just have to read the writer’s mind. If you get good enough at reading their mind, you can use a certain amount of keyword search in the architecture documents, too.
I don’t actually know if I count as a “senior developer”, but I’m pretty convinced I count as a “senior debugger” given the amount of time I’ve spent chasing problems in other people’s often-unfamiliar code.
These questions feel really hard to answer, mayber too general? When I try to answer them, I keep flying off into disconnected lists of heuristics and “things to look for”.
Also, you don’t give any examples of problems you’ve found hard, and I feel I may be answering at too simplistic a level. But...
The thing is that there is a discernable pattern. Once you’ve fixed the bug, you’ll probably be able to say exactly what triggers it.
If you can’t (yet) reproduce the bug, think about states of the program that could have led to whatever you’re seeing.
You know what routine detected a problem[1], and you usually know what routine produced bad output (it’s the one that should have produced good output at that point).
Your suspect routine usually uses a relatively small set of inputs to do whatever it does. It can only react to data that it at least examines. So what does it use to do what it does? Where do its inputs come from, not necessarily in the sense of what calling routine passes them in, but in the sense of how they enter the overall program? What values can they take? What values would they have to take to produce the behavior you’re seeing? How could they end up taking those particular values? What parts of the program and environment state are likely to vary, and how will they affect this routine?
Very often, you can answer those questions without getting bogged down in the call graph or having to try a huge number of cases. Sometimes you can not just reproduce the bug, but actually fix it.
If function X is complaining about a “font not found” error, then it’s presumably looking up some font name or other identifier in some font database. There probably aren’t that many places that the font identifier can be coming from, and there’s probably only one font database.
If you can say, “well, the font should either be the default system font, or a font specified in the user profile”, then you can make an intuitive leap. You know that everything uses the default system font all the time, so that path probably works… but maybe it’s possible for a user profile to end up referring to a font that doesn’t exist… but I know the code checks for that, and I can’t set a bogus font on my own profile… but wait, what if the font gets deleted after the user picks it?
Of course, there are lots of other possibilities[2]. Maybe there’s some weird corner case where the font database isn’t available, or you’re using the wrong one, or it’s corrupted, or whatever. But it’s unlikely to be something completely unrelated that’s happening in some function in the call stack that doesn’t use the font information at all.
Or maybe function X is searching by font attributes instead of names. So where might it be getting extra constraints on the query?
Or maybe function Y is blowing up trying to use a null value. The root cause is almost certainly that you made a poor choice of programming language, but you can’t fix that now, so persevere. Usually you have the line of code where it choked, but say you don’t. What values are in scope in Y that could be null? Well, it gets called with a font descriptor. Could that be null? Well, wait, the font lookup routine returns a null value if it can’t find a font. So maybe we have a font not found error in disguise. So try thinking about it, for a limited time, as a font-not-found error.
Or maybe “fnord!” is showing up randomly in the output. So where could it come from? First step: brute force.
grep -ir 'fnord' src. Leave off the exclamation point at least to start. Punctuation tends to be quoted weirdly or added by code. If it’s not there, is it in the binary? Is it in the database? Is it in the config file? Is it anywhere on the whole damned system? If not, that leaves what? Probably the network.In the end, though, there’s also a certain amount of pattern matching against experience. “Code that does X is usually structured like Y”. “Does this thing have access to the files it needs?”. “Programmers always forget about cases like Z”. “I always forget about cases like W”. “Weird daemon behavior that you can’t reproduce interactively is always caused by SELinux”.
Once you do reproduce the bug, you can always just switch to a strategy of brute force tracing everything that goes on anywhere near it, with a debugger, with printing or logging, with a system call tracer, or whatever. But, yeah, you’ve got to get it to happen if you want to trace it.
It depends on what you’re trying to find out.
I find I mostly use two kinds of documentation:
Architectural stuff. General material that explains the concepts running around in the code, the terminology, what objects exist, what their life cycles look like, etc. This sort of documentation is often either nonexistent, or so bloated and disorganized as to be useless for quick debugging, and maybe useless period. But if it exists and is any good, it’s gold. If you’re trying to educate yourself about something that you’re going to use heavily, you may just have to slog through all of whatever’s available.
API documentation, ideally with links to source code. I usually don’t even skim this. I navigate it with keyword search. If you want to now how to frobnicate a blenk, then search for “frobnicate” and whatever synonyms you can come up with[3]. If you’re trying to debug a stack trace from a library routine, look up that routine and see what parameters it takes.
In my never especially humble opinion, “tutorials” are mostly wastes of time beyond the first couple of units, and it’s unfortunate that people concentrate so much on them instead of writing decent architecture and concept documentation.
Blowing up with a stack trace counts as “detecting the problem”
One important part of the skill is allocating your time and attention, and not getting stuck on paths that aren’t bearing fruit. You can always come back to an idea if nothing else works either. The counter-consideration is that if you don’t think at least a little bit deeply about whatever avenue you’re exploring, you’re unlikely to have a good sense of how fruitful it looks. So you have to balance breadth against depth in your search.
If the architecture documentation doesn’t suck, you can get likely terms by reading it. Or ask an LLM that’s probably internalized it. Otherwise you just have to read the writer’s mind. If you get good enough at reading their mind, you can use a certain amount of keyword search in the architecture documents, too.