As davidad suggests in that tweet, one way you might end up running into this is with RL that reinforces successful trajectories without great credit assignment, which could result in a model having very high confidence that its actions are always right. In practice this wasn’t obvious enough to be caught by various evals, and IMO could easily translate over into settings like high-stakes alignment research.
Related: Gemini being delusionally confident that its misclicks are always due to system / human error rather than mistakes it may have made.
As davidad suggests in that tweet, one way you might end up running into this is with RL that reinforces successful trajectories without great credit assignment, which could result in a model having very high confidence that its actions are always right. In practice this wasn’t obvious enough to be caught by various evals, and IMO could easily translate over into settings like high-stakes alignment research.