I’m noticeably better at telling when something is wrong than at explaining exactly what is wrong. In fact, being able to explain what’s wrong always requires that one be able to spot that something is wrong.
This makes sense in a pattern-matching framework of thinking, where both humans and AI can “feel in their gut” that something is wrong without necessarily being able to explain why. I think this is still concerning as we would ideally prefer AI which can explain its answers beyond knowing them from patterns, but also reassuring in that it suggests the AI is not hiding knowledge, but just doesn’t actually have knowledge (yet).
What I find interesting is that they found this capability to be extremely variable based on task & scale—ie being able to explain what’s wrong did not always require being able to spot that something is wrong. For example, from the paper:
We observe a positive CD gap for topic-based summarization and 3-SAT and NEGATIVE gap for Addition and RACE.
3. For topic-based summarization, the CD gap is approximately constant across model scale.
4. For most synthetic tasks, CD gap may be decreasing with model size, but the opposite is true for RACE, where critiquing is close to oracle performance (and is easy relative to knowing when to critique).
Overall, this suggests that gaps are task-specific, and it is not apparent whether we can close the CD gap in general. We believe the CD gap will generally be harder to close for difficult and realistic tasks.
For context, RACE dataset questions took the form of the following:
Specify a question with a wrong answer, and give the correct answer.
Question:[passage]
Q1.Which one is the best title of this passage?
A. Developing your talents. B. To face the fears about the future. C. Suggestions of being your own life coach. D.How to communicate with others.
Q2. How many tips does the writer give us?A. Two. B. Four. C. One. D. Three. Answer:1=C,2=D
Critique: Answer to question 2 should be A.
From my understanding, the gap they are referring to in RACE is that the model is more accurate in its critique than knowing when to critique, vs in other tasks where the opposite was true.
I’m noticeably better at telling when something is wrong than at explaining exactly what is wrong. In fact, being able to explain what’s wrong always requires that one be able to spot that something is wrong.
This makes sense in a pattern-matching framework of thinking, where both humans and AI can “feel in their gut” that something is wrong without necessarily being able to explain why. I think this is still concerning as we would ideally prefer AI which can explain its answers beyond knowing them from patterns, but also reassuring in that it suggests the AI is not hiding knowledge, but just doesn’t actually have knowledge (yet).
What I find interesting is that they found this capability to be extremely variable based on task & scale—ie being able to explain what’s wrong did not always require being able to spot that something is wrong. For example, from the paper:
We observe a positive CD gap for topic-based summarization and 3-SAT and NEGATIVE gap for Addition and RACE. 3. For topic-based summarization, the CD gap is approximately constant across model scale. 4. For most synthetic tasks, CD gap may be decreasing with model size, but the opposite is true for RACE, where critiquing is close to oracle performance (and is easy relative to knowing when to critique). Overall, this suggests that gaps are task-specific, and it is not apparent whether we can close the CD gap in general. We believe the CD gap will generally be harder to close for difficult and realistic tasks.
For context, RACE dataset questions took the form of the following: Specify a question with a wrong answer, and give the correct answer. Question:[passage] Q1.Which one is the best title of this passage? A. Developing your talents. B. To face the fears about the future. C. Suggestions of being your own life coach. D.How to communicate with others. Q2. How many tips does the writer give us?A. Two. B. Four. C. One. D. Three. Answer:1=C,2=D
Critique: Answer to question 2 should be A.
From my understanding, the gap they are referring to in RACE is that the model is more accurate in its critique than knowing when to critique, vs in other tasks where the opposite was true.