I’d be interested to see some test more favorable to the humans. Maybe humans are better at judging longer completions due to some kind of coherence between tokens, so a test could be
Human attempts to distinguish between 5-token GPT-3 continuation and the truth
GPT-3 attempts to distinguish between 5-token human continuation and the truth
and whichever does better is better at language modeling? It still seems like GPT-3 would win this one, but maybe there are other ways to measure more human abilities.
I suspect better human zero shot perf on a more GAN like objective on a more interesting dataset.
Take a new unpublished (human-written) story, and for every sentence or paragraph the testee must pick the correct completion multiple choice style, with a single correct answer and 3-4 incorrect but plausible completions generated by a LLM.
I expect humans to do better on longer time window discernment. The advantage of this test setup is it’s much closer to tasks humans already have some training on.
I’d be interested to see some test more favorable to the humans. Maybe humans are better at judging longer completions due to some kind of coherence between tokens, so a test could be
Human attempts to distinguish between 5-token GPT-3 continuation and the truth
GPT-3 attempts to distinguish between 5-token human continuation and the truth
and whichever does better is better at language modeling? It still seems like GPT-3 would win this one, but maybe there are other ways to measure more human abilities.
I suspect better human zero shot perf on a more GAN like objective on a more interesting dataset.
Take a new unpublished (human-written) story, and for every sentence or paragraph the testee must pick the correct completion multiple choice style, with a single correct answer and 3-4 incorrect but plausible completions generated by a LLM.
I expect humans to do better on longer time window discernment. The advantage of this test setup is it’s much closer to tasks humans already have some training on.