Mikhail Samin comments on Why Corrigibility is Hard and Important (i.e. “Whence the high MIRI confidence in alignment difficulty?”)

Mikhail Samin 2 Oct 2025 15:44 UTC
2 points
0
Giving the AI only corrigibility as a terminal goal is not impossible; it is merely anti-natural for many reasons including because the goal-achieving machinery still there will, with a terminal goal other than corrigibility, output the same seemingly corrigible behavior while tested, for instrumental reasons; and our training setups do not know how to distinguish between the two; and growing the goal-achieving machinery to be good at pursuing particular goals will make it attempt to have a goal other than corrigibility crystallize. Gradient descent will attempt to go to other places.
But sure, if you’ve successfully given your ASI corrigibility as the only terminal goal, congrats, you’ve gone much further than MIRI expected humanity to go with anything like the current tech. The hardest bit was getting there.
I would be surprised if Max considers corrigibility to have been reduced to an engineering problem.