The <@mesa optimization@>(@Risks from Learned Optimization in Advanced Machine Learning Systems@) paper defined an optimizer as a system that internally searches through a search space for elements that score high according to some explicit objective function. However, humans would not qualify as mesa optimizers by this definition, since there (presumably) isn’t some part of the brain that explicitly encodes some objective function that we then try to maximize. In addition, there are inner alignment failures that don’t involve mesa optimization: a small feedforward neural net doesn’t do any explicit search; yet when it is trained in the <@chest and keys environment@>(@A simple environment for showing mesa misalignment@), it learns a policy that goes to the nearest key, which is equivalent to a key-maximizer. Rather than talking about “mesa optimizers”, the post recommends that we instead talk about “malign generalization”, to refer to the problem when <@capabilities generalize but the objective doesn’t@>(@2-D Robustness@).
Planned opinion:
I strongly agree with this post (though note that the post was written right after a conversation with me on the topic, so this isn’t independent evidence). I find it very unlikely that most powerful AI systems will be optimizers as defined in the original paper, but I do think that the malign generalization problem will apply to our AI systems. For this reason, I hope that future research doesn’t specialize to the case of explicit-search-based agents.
Planned summary for the Alignment newsletter:
Planned opinion: