Contains a bug that would take at least 6 hours for a skilled programmer to solve, and ideally >20hrs
This is an odd phrasing to me in two ways
Contains a bug that would [be hard] to solve
So I think a lot of this depends on your definition of “solve”. I frequently run into bugs where I expect identifying and fixing the exact root cause of the bug would take upwards of 20 hours (e.g. it’s clearly some sort of a race condition, but nobody has yet managed to reproduce it) but sidestepping the bug is fast (slap a lock on updating whatever entity around both entire operations that were trying to update that entity).
For an example of what I mean, see the Segmentation Fault when Using SentenceTransformer Inside Docker Container question: the conclusion seems to be “there’s a known bug with using pytorch with Python 3.11 on an Apple M2 Mac within a docker container, you can fix it by using a different version of Python, a different version of pytorch, or a different physical machine.”
6 hours for a skilled programmer to solve
In my experience a lot of bugs are almost impossible to diagnose if you’ve never seen a bug in this class before and don’t know how to use the debugging toolchain, and trivial to diagnose if you have seen and dealt with this kind of bug before.
Looking at the same example from before, I bet there’s at least one engineer at Apple and one pytorch dev who, if you got them together, could fire up Apple’s internal equivalent of gdb and figure out exactly what tensor operations SentenceTransformer.forward() is trying to do, which operation failed, why it failed, and what equivalent operation would work in place of the failing one. It’s likely something extremely dumb and shallow if you get the right people in a room together working on it.
Without the ability to debug what’s going on in the apple-specific part of the stack I bet this would take at least 20 hours to solve, probably much more (because the tooling sucks, not because the problem is inherently difficult).
So I guess my question is whether “hard” bugs of the form “this supposedly portable program crashes when you first try to use it on that platform, and neither the program nor the platform are well documented” count here. If so there are countless public examples that are not solved and likely never will be solved.
Ideally the task should work well with static resources—e.g. you can have a local copy of the documentation for all the relevant libraries, but don’t have general internet access. (This is because we want to make sure the difficulty doesn’t change over time, if e.g. someone posts a solution to stack overflow or whatever)
Great questions! We’re interested in tasks where we do actually have an example of it being solved, so that we can estimate the difficulty level. I think we’re interested in both tasks where you need to sidestep the bug somehow and make the program work, or ones where you need to specifically explain what was going wrong.
This wasn’t explained that well in the above, but the intended difficulty level is more like “6-20 hours for a decent engineer who doesn’t have context on this particular codebase”. E.g. a randomly selected engineer who’s paid $100-$200 per hour who’s familiar with the language and overall stack that’s being used, but not the person who wrote the code, and not an expert in the particular component that is causing the bug.
I’d be very interested if you have any ideas for a better way to get some kind of universal difficulty metric—it’s not great for our purposes if the “task difficulty” varies wildly between humans with the same on-paper qualifications.
Cool idea!
This is an odd phrasing to me in two ways
Contains a bug that would [be hard] to solve
So I think a lot of this depends on your definition of “solve”. I frequently run into bugs where I expect identifying and fixing the exact root cause of the bug would take upwards of 20 hours (e.g. it’s clearly some sort of a race condition, but nobody has yet managed to reproduce it) but sidestepping the bug is fast (slap a lock on updating whatever entity around both entire operations that were trying to update that entity).
For an example of what I mean, see the Segmentation Fault when Using SentenceTransformer Inside Docker Container question: the conclusion seems to be “there’s a known bug with using pytorch with Python 3.11 on an Apple M2 Mac within a docker container, you can fix it by using a different version of Python, a different version of pytorch, or a different physical machine.”
6 hours for a skilled programmer to solve
In my experience a lot of bugs are almost impossible to diagnose if you’ve never seen a bug in this class before and don’t know how to use the debugging toolchain, and trivial to diagnose if you have seen and dealt with this kind of bug before.
Looking at the same example from before, I bet there’s at least one engineer at Apple and one pytorch dev who, if you got them together, could fire up Apple’s internal equivalent of gdb and figure out exactly what tensor operations
SentenceTransformer.forward()
is trying to do, which operation failed, why it failed, and what equivalent operation would work in place of the failing one. It’s likely something extremely dumb and shallow if you get the right people in a room together working on it.Without the ability to debug what’s going on in the apple-specific part of the stack I bet this would take at least 20 hours to solve, probably much more (because the tooling sucks, not because the problem is inherently difficult).
So I guess my question is whether “hard” bugs of the form “this supposedly portable program crashes when you first try to use it on that platform, and neither the program nor the platform are well documented” count here. If so there are countless public examples that are not solved and likely never will be solved.
Ideally the task should work well with static resources—e.g. you can have a local copy of the documentation for all the relevant libraries, but don’t have general internet access. (This is because we want to make sure the difficulty doesn’t change over time, if e.g. someone posts a solution to stack overflow or whatever)
Great questions!
We’re interested in tasks where we do actually have an example of it being solved, so that we can estimate the difficulty level.
I think we’re interested in both tasks where you need to sidestep the bug somehow and make the program work, or ones where you need to specifically explain what was going wrong.
This wasn’t explained that well in the above, but the intended difficulty level is more like “6-20 hours for a decent engineer who doesn’t have context on this particular codebase”. E.g. a randomly selected engineer who’s paid $100-$200 per hour who’s familiar with the language and overall stack that’s being used, but not the person who wrote the code, and not an expert in the particular component that is causing the bug.
I’d be very interested if you have any ideas for a better way to get some kind of universal difficulty metric—it’s not great for our purposes if the “task difficulty” varies wildly between humans with the same on-paper qualifications.