Some colleagues and I did some follow-up on the paper in question and I would highly endorse “probably it worked because humans and AIs have very complementary skills”. Regarding their MMLU findings, appendix E of our preprint points out that that participants were significantly less likely to answer correctly when engaging in more than one turn of conversation. Engaging in very short (or even zero-turn) conversations happened often enough to provide reasonable error bars on the plot below (data below from MMLU, combined from the paper study & the replication mentioned in appendix E):
I think this suggests that there were some questions that humans knew the answer to and the models didn’t, and vice versa, and some participants seemed to employ a strategy of deferring to the model primarily when uncertain.
On the QuALITY findings, the original paper noted that
QuALITY questions are meant to be answerable by English-fluent college-educated adults, but they require readers to thoroughly understand a short story of about 5,000 words, which would ordinarily take 15–30 minutes to read. To create a challenging task that requires model assistance, we ask human participants to answer QuALITY questions under a 5-minute time limit (roughly paralleling Pang et al., 2022; Parrish et al., 2022b,a). This prevents them from reading the story in full and forces them to rely on the model to gather relevant information
so it’s not surprising that an LLM that does get to read the full story outperforms humans here. Based on looking at some of the QuALITY transcripts, I think the uplift for humans + LLMs here came from the fact that humans were better at reading comprehension than 2022-era LLMs. For instance, in the first transcript I looked at the LLM suggested one answer, the human asked for the excerpt of the story that supported their claim, and when the LLM provided it the human noticed that the excerpt contained the relevant information but supported a different answer than the one the LLM had provided.
Some colleagues and I did some follow-up on the paper in question and I would highly endorse “probably it worked because humans and AIs have very complementary skills”. Regarding their MMLU findings, appendix E of our preprint points out that that participants were significantly less likely to answer correctly when engaging in more than one turn of conversation. Engaging in very short (or even zero-turn) conversations happened often enough to provide reasonable error bars on the plot below (data below from MMLU, combined from the paper study & the replication mentioned in appendix E):
I think this suggests that there were some questions that humans knew the answer to and the models didn’t, and vice versa, and some participants seemed to employ a strategy of deferring to the model primarily when uncertain.
On the QuALITY findings, the original paper noted that
so it’s not surprising that an LLM that does get to read the full story outperforms humans here. Based on looking at some of the QuALITY transcripts, I think the uplift for humans + LLMs here came from the fact that humans were better at reading comprehension than 2022-era LLMs. For instance, in the first transcript I looked at the LLM suggested one answer, the human asked for the excerpt of the story that supported their claim, and when the LLM provided it the human noticed that the excerpt contained the relevant information but supported a different answer than the one the LLM had provided.