Thomas Kwa comments on Language models seem to be much better than humans at next-token prediction

Thomas Kwa 23 Aug 2022 17:59 UTC
2 points
1
I’d be interested to see some test more favorable to the humans. Maybe humans are better at judging longer completions due to some kind of coherence between tokens, so a test could be
- Human attempts to distinguish between 5-token GPT-3 continuation and the truth
- GPT-3 attempts to distinguish between 5-token human continuation and the truth
and whichever does better is better at language modeling? It still seems like GPT-3 would win this one, but maybe there are other ways to measure more human abilities.
- jacob_cannell 23 Aug 2022 18:25 UTC
  2 points
  0
  Parent
  I suspect better human zero shot perf on a more GAN like objective on a more interesting dataset.
  
  Take a new unpublished (human-written) story, and for every sentence or paragraph the testee must pick the correct completion multiple choice style, with a single correct answer and 3-4 incorrect but plausible completions generated by a LLM.
  
  I expect humans to do better on longer time window discernment. The advantage of this test setup is it’s much closer to tasks humans already have some training on.