I played the Perplexity game (the one where you select probabilities) and easily beat the 2-layer model and was
close to the 12-layer model most of the time. Matching the 24-layer model feels achievable after some work on calibrating my probability intuitions.
Most of the difficulty was from trying to decide exactly how confident I was rather than not knowing which token was more likely. Humans are pretty bad at this, especially when probabilities are close to 0 or 1 (I lost the most ground compared to the models in situations where I failed to distinguish between a 1%/99% probability and a 10%/90% probability). I suspect changing the game so that humans are not severely punished as severely for poor calibration would really help to achieve a more meaningful comparison. Even just changing the most extreme selection options of 1%/99% to 5%/95% would help level the playing field a bit since humans have poor intuition toward the probability extremes. As it is, a human superforecaster with poor language skills would likely crush someone with extraordinary language skills but average probability calibration.
I’d be interested in seeing how humans would compare on a similar test but where their task is more along the lines of choosing the top 2-out-of-5 or top 3-out-of-10 options.
I played the Perplexity game (the one where you select probabilities) and easily beat the 2-layer model and was close to the 12-layer model most of the time. Matching the 24-layer model feels achievable after some work on calibrating my probability intuitions.
40 questions result: 747 (me) | −55 (2-layer) | 755 (12-layer) | 866 (24-layer)
Most of the difficulty was from trying to decide exactly how confident I was rather than not knowing which token was more likely. Humans are pretty bad at this, especially when probabilities are close to 0 or 1 (I lost the most ground compared to the models in situations where I failed to distinguish between a 1%/99% probability and a 10%/90% probability). I suspect changing the game so that humans are not severely punished as severely for poor calibration would really help to achieve a more meaningful comparison. Even just changing the most extreme selection options of 1%/99% to 5%/95% would help level the playing field a bit since humans have poor intuition toward the probability extremes. As it is, a human superforecaster with poor language skills would likely crush someone with extraordinary language skills but average probability calibration.
I’d be interested in seeing how humans would compare on a similar test but where their task is more along the lines of choosing the top 2-out-of-5 or top 3-out-of-10 options.