Agree that this is looks like a promising approach. People interested in this idea can read some additional discussion in Scenario 2: Reliable mesa-optimizer detection and precise goal read-offs from my post from May, “Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios”.
As you mention, having this kind of advanced interpretability essentially solves the inner alignment problem, but leaves a big question mark about outer alignment. In that Scenario 2 link above, I have some discussion of expected impacts from this kind of interpretability on a bunch of different outer alignment and robustness techniques including: Relaxed adversarial training, Intermittent oversight, Imitative amplification, Approval-based amplification, Recursive reward modeling, Debate, Market making, Narrow reward modeling, Multi-agent, Microscope AI, STEM AI and Imitative generalization. [1] (You need to follow the link to the Appendix 1 section about this scenario though to get some of these details).
I’m not totally sure that the ability to reliably detect mesa-optimizers and their goals/optimization targets would automatically grant us the ability to “Just Retarget the Search” on a hot model. It might, but I agree with your section on Problems that it may look more like restarting training on models where we detect a goal that’s different from what we want. But this still seems like it could accomplish a lot of what we want from being able to retargeting the search on a hot model, even though it’s clunkier.
--
[1]: In a lot of these techniques it can make sense to check that the mesa-optimizer is aligned (and do some kind of goal retargeting if it’s not). However, in others we probably want to take advantage of this kind of advanced interpretability in different ways. For example, in Imitative amplification, we can just use it to make sure mesa-optimization is not introduced during distillation steps, rather than checking that mesa-optimization is introduced but is also aligned.
Agree that this is looks like a promising approach. People interested in this idea can read some additional discussion in Scenario 2: Reliable mesa-optimizer detection and precise goal read-offs from my post from May, “Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios”.
As you mention, having this kind of advanced interpretability essentially solves the inner alignment problem, but leaves a big question mark about outer alignment. In that Scenario 2 link above, I have some discussion of expected impacts from this kind of interpretability on a bunch of different outer alignment and robustness techniques including: Relaxed adversarial training, Intermittent oversight, Imitative amplification, Approval-based amplification, Recursive reward modeling, Debate, Market making, Narrow reward modeling, Multi-agent, Microscope AI, STEM AI and Imitative generalization. [1] (You need to follow the link to the Appendix 1 section about this scenario though to get some of these details).
I’m not totally sure that the ability to reliably detect mesa-optimizers and their goals/optimization targets would automatically grant us the ability to “Just Retarget the Search” on a hot model. It might, but I agree with your section on Problems that it may look more like restarting training on models where we detect a goal that’s different from what we want. But this still seems like it could accomplish a lot of what we want from being able to retargeting the search on a hot model, even though it’s clunkier.
--
[1]: In a lot of these techniques it can make sense to check that the mesa-optimizer is aligned (and do some kind of goal retargeting if it’s not). However, in others we probably want to take advantage of this kind of advanced interpretability in different ways. For example, in Imitative amplification, we can just use it to make sure mesa-optimization is not introduced during distillation steps, rather than checking that mesa-optimization is introduced but is also aligned.