I think this is one of the major remaining open question wrt inner alignment. Personally, I think there is a meaningful sense in which all the models I’m most worried about do some sort of search internally (at least to the same extent that humans do search internally), but I’m definitely uncertain about that. If true, though, it could be quite helpful for solving inner alignment, since it could enable us to factor models into pieces (either through architecture or transparency tools). Also:
As far as I can tell, Hjalmar Wijk introduced the term “malign generalization” to describe the failure mode that I think is most worth worrying about here.
Hjalmar actually cites this post by Paul Christiano as the source of that term—though Hjalmar’s usage is slightly different.
I think this is one of the major remaining open question wrt inner alignment. Personally, I think there is a meaningful sense in which all the models I’m most worried about do some sort of search internally (at least to the same extent that humans do search internally), but I’m definitely uncertain about that. If true, though, it could be quite helpful for solving inner alignment, since it could enable us to factor models into pieces (either through architecture or transparency tools). Also:
Hjalmar actually cites this post by Paul Christiano as the source of that term—though Hjalmar’s usage is slightly different.