Right. Task Difficulty is a hard thing to get a handle on.
* You don’t know how hard a problem is until you’ve solved it, so any metric needs to depend on how hard the problem was to solve.
* Intuitively, we might want some metric that depends on the “stack trace” of the solution, i.e. what sorts of mental moves had to happen for the person to solve the problem. Incidentally, this means that all such metrics are sometimes over-estimates (maybe there’s an easy way to solve the problem that the person you watched solving it missed.) Human wall-clock time is in some sense the simplest question one could ask about the stack trace of a human solving the problem.
* The difficulty of a problem is often agent relative. There are plenty of contest math problems that are rendered easy if you have the correct tool in your toolkit and really hard if you don’t. Crystalized intelligence often passes for fluid intelligence, and the two blend into each other.
Some other potential metrics (loose brainstorm)
* Hint length—In some of Eliezer’s earlier posts, intelligence got measured as optimization pressure in bits (intuitively: how many times do you have to double the size of the target for it to fill the entire dart-board. Of course you need a measure space for your dart board in order for this to work.) Loosely inspired by this, we might pick some model that’s capable of following chains of logic but not very smart (whatever it knows how to do is a stand in for what’s obvious.) Then ask how long of a hint string you have to hand it to solve the problem. (Of course finding the shortest hint string is hard; you’d need to poke around heuristically to find a relatively short one.)
* ELO score type metrics—You treat each of your puzzles and agents (which can be either AIs or humans) as players of a two player game. If the agent solves a puzzle the agent wins, otherwise the puzzle wins. Then we calculate ELO scores. The nice thing about this is that we effectively get to punt the problem of defining a difficulty metric, by saying that each agent has a latent intelligence variable and each problem has a latent difficulty variable, and we can figure out what both of them are together by looking at who was able to solve which problem.
Caveats: Of course, like human wall-clock time, this assumes intelligence and difficulty are one-dimensional, though of course if you can say what you’d like to assume instead, you can make statistical models more sophisticated than the one ELO scoring implicitly assumes. Also, this still doesn’t help for measuring the difficulties of problems way outside the eval set (the using “Paleolithic canoeing records to forecast when humans will reach the moon” obstacle) if everybody loses against puzzle X that doesn’t put much of an upper bound on how hard it is.
Right. Task Difficulty is a hard thing to get a handle on.
* You don’t know how hard a problem is until you’ve solved it, so any metric needs to depend on how hard the problem was to solve.
* Intuitively, we might want some metric that depends on the “stack trace” of the solution, i.e. what sorts of mental moves had to happen for the person to solve the problem. Incidentally, this means that all such metrics are sometimes over-estimates (maybe there’s an easy way to solve the problem that the person you watched solving it missed.) Human wall-clock time is in some sense the simplest question one could ask about the stack trace of a human solving the problem.
* The difficulty of a problem is often agent relative. There are plenty of contest math problems that are rendered easy if you have the correct tool in your toolkit and really hard if you don’t. Crystalized intelligence often passes for fluid intelligence, and the two blend into each other.
Some other potential metrics (loose brainstorm)
* Hint length—In some of Eliezer’s earlier posts, intelligence got measured as optimization pressure in bits (intuitively: how many times do you have to double the size of the target for it to fill the entire dart-board. Of course you need a measure space for your dart board in order for this to work.) Loosely inspired by this, we might pick some model that’s capable of following chains of logic but not very smart (whatever it knows how to do is a stand in for what’s obvious.) Then ask how long of a hint string you have to hand it to solve the problem. (Of course finding the shortest hint string is hard; you’d need to poke around heuristically to find a relatively short one.)
* ELO score type metrics—You treat each of your puzzles and agents (which can be either AIs or humans) as players of a two player game. If the agent solves a puzzle the agent wins, otherwise the puzzle wins. Then we calculate ELO scores. The nice thing about this is that we effectively get to punt the problem of defining a difficulty metric, by saying that each agent has a latent intelligence variable and each problem has a latent difficulty variable, and we can figure out what both of them are together by looking at who was able to solve which problem.
Caveats: Of course, like human wall-clock time, this assumes intelligence and difficulty are one-dimensional, though of course if you can say what you’d like to assume instead, you can make statistical models more sophisticated than the one ELO scoring implicitly assumes. Also, this still doesn’t help for measuring the difficulties of problems way outside the eval set (the using “Paleolithic canoeing records to forecast when humans will reach the moon” obstacle) if everybody loses against puzzle X that doesn’t put much of an upper bound on how hard it is.