I think the failure case identified in this post is plausible (and likely) and is very clearly explained so props for that!
However, I agree with Jacob’s criticism here. Any AGI success story basically has to have “the safest model” also be “the most powerful” model, because of incentives and coordination problems.
Models that are themselves optimizers are going to be significantly more powerful and useful than “optimizer free” models. So the suggestion of trying to avoiding mesa-optimization altogether is a bit of a fabricated option. There is an interesting parallel here with the suggestion of just “not building agents” (https://www.gwern.net/Tool-AI).
So from where I am sitting, we have no option but to tackle aligning the mesa-optimizer cascade head-on.
This post seems to be using a different meaning of “consequentialism” to what I am familiar with (that of moral philosophy). Subsequently, I’m struggling to follow the narrative from “consequentialism is convergently instrumental” onwards.
Can someone give me some pointers of how I should be interpreting the definition of consequentialism here? If it is just the moral philosophy definition, then I’m getting very confused as to why “judge morality of actions by their consequences” is a useful subgoal for agents to optimize against...