evhub comments on Is the term mesa optimizer too narrow?

evhub 16 Dec 2019 18:17 UTC
LW: 9 AF: 6
0
AF
I think this is one of the major remaining open question wrt inner alignment. Personally, I think there is a meaningful sense in which all the models I’m most worried about do some sort of search internally (at least to the same extent that humans do search internally), but I’m definitely uncertain about that. If true, though, it could be quite helpful for solving inner alignment, since it could enable us to factor models into pieces (either through architecture or transparency tools). Also:

As far as I can tell, Hjalmar Wijk introduced the term “malign generalization” to describe the failure mode that I think is most worth worrying about here.

Hjalmar actually cites this post by Paul Christiano as the source of that term—though Hjalmar’s usage is slightly different.
What links here?
- abergal's comment on Discussion: Objective Robustness and Inner Alignment Terminology by jbkjr (24 Jun 2021 22:40 UTC; 21 points)