It’s based on those estimates and the systematic biases in such methods & literatures. Just as you know that psychology and medical effects are always overestimated and can be rounded down by 50% to get a more plausible real world estimate, such information-theoretic methods will always overestimate model performance and underestimate human performance, and are based on various idealizations: they use limited genres and writing styles (formal, omitting informal like slang), don’t involve extensive human calibration or training like the models get, don’t involve any adversarial examples, don’t try to test human reasoning by writing up texts made up of logical riddles and puzzles or complicated cause-and-effect scenarios or even things like Winograd Schemas, are time-biased, etc. We’ve seen a lot of these issues come up in benchmarking, like ImageNet models outside ImageNet despite hitting human parity or superiority. (If we are interested in truly testing ‘compression = intelligence’, we need texts which stress all capabilities and remove all of those issues.)
So given Shannon’s interval’s lower end is 0.6, and Grassberger’s asymptotic is 0.8 (the footnote 11) and a widespread of upper bounds going down to 1.3 along with extremely dumb fast algorithms hitting 2, I am comfortable with rounding them downish to get estimates of 0.7 bpc being the human performance; and I expect that to, if anything, be still underestimating true human peak performance, so I wouldn’t be shocked if it was actually more like 0.6 bpc.
It’s based on those estimates and the systematic biases in such methods & literatures. Just as you know that psychology and medical effects are always overestimated and can be rounded down by 50% to get a more plausible real world estimate, such information-theoretic methods will always overestimate model performance and underestimate human performance, and are based on various idealizations: they use limited genres and writing styles (formal, omitting informal like slang), don’t involve extensive human calibration or training like the models get, don’t involve any adversarial examples, don’t try to test human reasoning by writing up texts made up of logical riddles and puzzles or complicated cause-and-effect scenarios or even things like Winograd Schemas, are time-biased, etc. We’ve seen a lot of these issues come up in benchmarking, like ImageNet models outside ImageNet despite hitting human parity or superiority. (If we are interested in truly testing ‘compression = intelligence’, we need texts which stress all capabilities and remove all of those issues.)
So given Shannon’s interval’s lower end is 0.6, and Grassberger’s asymptotic is 0.8 (the footnote 11) and a widespread of upper bounds going down to 1.3 along with extremely dumb fast algorithms hitting 2, I am comfortable with rounding them downish to get estimates of 0.7 bpc being the human performance; and I expect that to, if anything, be still underestimating true human peak performance, so I wouldn’t be shocked if it was actually more like 0.6 bpc.