Sources:
https://web.stanford.edu/~jurafsky/slp3/
https://www.isca-speech.org/archive/Interspeech_2017/abstracts/0729.html
The latter is the source for human perplexity being 12. I should note that it tested on the 1 Billion Words benchmark, where GPT-2 scored 42.2 (35.8 was for Penn Treebank), so the results are not exactly 1:1.
Sources:
https://web.stanford.edu/~jurafsky/slp3/
https://www.isca-speech.org/archive/Interspeech_2017/abstracts/0729.html
The latter is the source for human perplexity being 12. I should note that it tested on the 1 Billion Words benchmark, where GPT-2 scored 42.2 (35.8 was for Penn Treebank), so the results are not exactly 1:1.