RogerDearnaley comments on Corrigibility Scales To Value Alignment

RogerDearnaley 15 Jan 2026 0:37 UTC
2 points
0
I assume there will be a nontrivial period when the AI behaves corrigibly in situations that closely resemble its training environment, but would behave incorrigibly in some unusual situations.
As a rule of thumb, anything smart enough to be dangerous is dangerous because it can do scientific research and self-improve. If it can’t tell when it’s out-of-distribution and might need to generate some new hypotheses, it can’t do scientific research, so it’s not that dangerous. So yes, there might be some very unusual situation that decreased corrigibility: but for an ASI, just taking it out-of-training-distribution should pretty-reliably cause it to say “I know that I don’t know what I’m doing, so I should be extra cautious/pessimistic, and that includes being extra-corrigible.”