Quintin Pope comments on OpenAI: GPT-based LLMs show ability to discriminate between its own wrong answers, but inability to explain how/why it makes that discrimination, even as model scales

Quintin Pope 14 Jun 2022 1:31 UTC
7 points
0
I’m noticeably better at telling when something is wrong than at explaining exactly what is wrong. In fact, being able to explain what’s wrong always requires that one be able to spot that something is wrong.
- Aditya Jain 14 Jun 2022 4:44 UTC
  1 point
  0
  Parent
  This makes sense in a pattern-matching framework of thinking, where both humans and AI can “feel in their gut” that something is wrong without necessarily being able to explain why. I think this is still concerning as we would ideally prefer AI which can explain its answers beyond knowing them from patterns, but also reassuring in that it suggests the AI is not hiding knowledge, but just doesn’t actually have knowledge (yet).
  
  What I find interesting is that they found this capability to be extremely variable based on task & scale—ie being able to explain what’s wrong did not always require being able to spot that something is wrong. For example, from the paper:
  
  We observe a positive CD gap for topic-based summarization and 3-SAT and NEGATIVE gap for Addition and RACE. 3. For topic-based summarization, the CD gap is approximately constant across model scale. 4. For most synthetic tasks, CD gap may be decreasing with model size, but the opposite is true for RACE, where critiquing is close to oracle performance (and is easy relative to knowing when to critique). Overall, this suggests that gaps are task-specific, and it is not apparent whether we can close the CD gap in general. We believe the CD gap will generally be harder to close for difficult and realistic tasks.
  
  For context, RACE dataset questions took the form of the following: Specify a question with a wrong answer, and give the correct answer. Question:[passage] Q1.Which one is the best title of this passage? A. Developing your talents. B. To face the fears about the future. C. Suggestions of being your own life coach. D.How to communicate with others. Q2. How many tips does the writer give us?A. Two. B. Four. C. One. D. Three. Answer:1=C,2=D
  
  Critique: Answer to question 2 should be A.
  
  From my understanding, the gap they are referring to in RACE is that the model is more accurate in its critique than knowing when to critique, vs in other tasks where the opposite was true.

Quintin Pope comments on OpenAI: GPT-based LLMs show ability to discriminate between its own wrong answers, but inability to explain how/​why it makes that discrimination, even as model scales

Quintin Pope comments on OpenAI: GPT-based LLMs show ability to discriminate between its own wrong answers, but inability to explain how/why it makes that discrimination, even as model scales