Another reason to not expect the selection argument to work is that it’s instrumentally convergent for most inner agent values to not become wireheaders, for them to not try hitting the reward button.
To me this implies that as the AI becomes more situationally aware it learns to avoid rewards that reinforce away its current goals (because it wants to preserve its goals.) As a result, throughout the training process, the AIs goals start out malleable and “harden” once the AI gains enough situational awareness. This implies that goals have to be simple enough for the agent to be able to model them early on in its training process.
The way I see it having a lower level understanding of things allows you to create abstractions about their behavior that you can use to understand them on a higher level. For example, if you understand how transistors work on a lower level you can abstract away their behavior and more efficiently examine how they wire together to create memory and processor. This is why I believe that a circuits-style approach is the most promising one we have for interpretability.
Do you agree that a lower level understanding of things is often the best way to achieve a higher level understanding, in particular regarding neural network interpretability, or would you advocate for a different approach?