(The human baseline is a loss of 0.7 bits, with lots of uncertainty on that figure.)
I’d like to know what this figure is based on. In the linked post, Gwern writes:
The pretraining thesis argues that this can go even further: we can compare this performance directly with humans doing the same objective task, who can achieve closer to 0.7 bits per character.
But in that linked post, there’s no mention of “0.7” bits in particular, as far as I or cmd-f can see. The most relevant passage I’ve read is:
Claude Shannonfound that each character was carrying more like 1 (0.6-1.3) bit of unguessable information (differing from genre to genre8); Hamid Moradi found 1.62-2.28 bits on various books9; Brown et al 1992 found <1.72 bits; Teahan & Cleary 1996 got 1.46; Cover & King 1978 came up with 1.3 bits10; and Behr et al 2002 found 1.6 bits for English and that compressibility was similar to this when using translations in Arabic/Chinese/French/Greek/Japanese/Korean/Russian/Spanish (with Japanese as an outlier). In practice, existing algorithms can make it down to just 2 bits to represent a character, and theory suggests the true entropy was around 0.8 bits per character.11
I’m not sure what the relationship is between supposedly unguessable information and human performance, but assuming that all these sources were actually just estimating human performance, and without looking into the sources more… this isn’t just lots of uncertainty, but vast amounts of uncertainty, where it’s very plausible that GPT-3 has already beaten humans. This wouldn’t be that surprising, given that GPT-3 must have memorised a lot of statistical information about how common various words are, which humans certainly don’t know by default.
I have a lot of respect for people looking into a literature like this and forming their own subjective guess, but it’d be good to know if that’s what happened here, or if there is some source that pinpoints 0.7 in particular as a good estimate.
It’s based on those estimates and the systematic biases in such methods & literatures. Just as you know that psychology and medical effects are always overestimated and can be rounded down by 50% to get a more plausible real world estimate, such information-theoretic methods will always overestimate model performance and underestimate human performance, and are based on various idealizations: they use limited genres and writing styles (formal, omitting informal like slang), don’t involve extensive human calibration or training like the models get, don’t involve any adversarial examples, don’t try to test human reasoning by writing up texts made up of logical riddles and puzzles or complicated cause-and-effect scenarios or even things like Winograd Schemas, are time-biased, etc. We’ve seen a lot of these issues come up in benchmarking, like ImageNet models outside ImageNet despite hitting human parity or superiority. (If we are interested in truly testing ‘compression = intelligence’, we need texts which stress all capabilities and remove all of those issues.)
So given Shannon’s interval’s lower end is 0.6, and Grassberger’s asymptotic is 0.8 (the footnote 11) and a widespread of upper bounds going down to 1.3 along with extremely dumb fast algorithms hitting 2, I am comfortable with rounding them downish to get estimates of 0.7 bpc being the human performance; and I expect that to, if anything, be still underestimating true human peak performance, so I wouldn’t be shocked if it was actually more like 0.6 bpc.
I’d like to know what this figure is based on. In the linked post, Gwern writes:
But in that linked post, there’s no mention of “0.7” bits in particular, as far as I or cmd-f can see. The most relevant passage I’ve read is:
I’m not sure what the relationship is between supposedly unguessable information and human performance, but assuming that all these sources were actually just estimating human performance, and without looking into the sources more… this isn’t just lots of uncertainty, but vast amounts of uncertainty, where it’s very plausible that GPT-3 has already beaten humans. This wouldn’t be that surprising, given that GPT-3 must have memorised a lot of statistical information about how common various words are, which humans certainly don’t know by default.
I have a lot of respect for people looking into a literature like this and forming their own subjective guess, but it’d be good to know if that’s what happened here, or if there is some source that pinpoints 0.7 in particular as a good estimate.
It’s based on those estimates and the systematic biases in such methods & literatures. Just as you know that psychology and medical effects are always overestimated and can be rounded down by 50% to get a more plausible real world estimate, such information-theoretic methods will always overestimate model performance and underestimate human performance, and are based on various idealizations: they use limited genres and writing styles (formal, omitting informal like slang), don’t involve extensive human calibration or training like the models get, don’t involve any adversarial examples, don’t try to test human reasoning by writing up texts made up of logical riddles and puzzles or complicated cause-and-effect scenarios or even things like Winograd Schemas, are time-biased, etc. We’ve seen a lot of these issues come up in benchmarking, like ImageNet models outside ImageNet despite hitting human parity or superiority. (If we are interested in truly testing ‘compression = intelligence’, we need texts which stress all capabilities and remove all of those issues.)
So given Shannon’s interval’s lower end is 0.6, and Grassberger’s asymptotic is 0.8 (the footnote 11) and a widespread of upper bounds going down to 1.3 along with extremely dumb fast algorithms hitting 2, I am comfortable with rounding them downish to get estimates of 0.7 bpc being the human performance; and I expect that to, if anything, be still underestimating true human peak performance, so I wouldn’t be shocked if it was actually more like 0.6 bpc.