I suspect better human zero shot perf on a more GAN like objective on a more interesting dataset.
Take a new unpublished (human-written) story, and for every sentence or paragraph the testee must pick the correct completion multiple choice style, with a single correct answer and 3-4 incorrect but plausible completions generated by a LLM.
I expect humans to do better on longer time window discernment. The advantage of this test setup is it’s much closer to tasks humans already have some training on.
I suspect better human zero shot perf on a more GAN like objective on a more interesting dataset.
Take a new unpublished (human-written) story, and for every sentence or paragraph the testee must pick the correct completion multiple choice style, with a single correct answer and 3-4 incorrect but plausible completions generated by a LLM.
I expect humans to do better on longer time window discernment. The advantage of this test setup is it’s much closer to tasks humans already have some training on.