Max Harms comments on 0. CAST: Corrigibility as Singular Target

Max Harms 28 Oct 2025 19:40 UTC
LW: 8 AF: 5
0
AF
Strong upvote! This strikes me as identifying the most philosophically murky part of the CAST plan. In the back half of this sequence I spend some time staring into the maw of manipulation, which I think is the thorniest issue for understanding corrigibility. There’s a hopeful thought that empowerment is a natural opposite of manipulation, but this is likely incomplete because there are issues about which entity you’re empowering, including counterfactual entities whose existence depends on the agent’s actions. Very thorny. I take a swing at addressing this in my formalism, by penalizing the agent for taking actions that cause value drift from the counterfactual where the agent doesn’t exist, but this is half-baked and I discuss some of the issues.