OpenAI: GPT-based LLMs show ability to discriminate between its own wrong answers, but inability to explain how/​why it makes that discrimination, even as model scales

Link post

This seems concerning. Not an expert so unable to tell how concerning it is. Wanted to start a discussion! Full text: https://​​openai.com/​​blog/​​critiques/​​

Edit: the full publication linked in the blog provides additional details on how they found this in testing. See Appendix C. I’m glad OpenAI is at least aware of this alignment issue and plans to address it with future language models, postulating how changes in training and/​or testing could ensure there is greater/​more accurate/​more honest model outputs.

Key text:

Do models tell us everything they know? To provide the best evaluation assistance on difficult tasks, we would like models to communicate all problems that they “know about.” Whenever a model correctly predicts that an answer is flawed, can the model also produce a concrete critique that humans understand?

This is particularly important for supervising models that could attempt to mislead human supervisors or hide information. We would like to train equally smart assistance models to point out what humans don’t notice.

Unfortunately, we found that models are better at discriminating than at critiquing their own answers, indicating they know about some problems that they can’t or don’t articulate. Furthermore, the gap between discrimination and critique ability did not appear to decrease for larger models. Reducing this gap is an important priority for our alignment research.