The idea is that manipulation “overrides″ the human policy regardless of whether that’s good for the goal the human is pursuing (where the human goal presumably affects what πH is selected). While here the override is baked into the dynamics, in realistic settings it occurs because the AI exploits the human decision-making process: by feeding them biased information, through emotional manipulation, etc.
I think this skips over the problems with baking it into the dynamics. Baking manipulation into the dynamics requires us to define manipulation; easy for toy examples, but in real-world applications it runs head-first into nearest unblocked strategy concerns; anything that you forget to define as manipulation is fully up for grabs.
This is why I prefer directly applying a counterfactual to the human policy in my proposal, to entirely override the possibility of manipulation. But that introduces its own difficulties, and is not easy to scale up beyond the stop button. I’ve had a post in the works for a while about the main difficulty I see with my approach here.
I think this skips over the problems with baking it into the dynamics. Baking manipulation into the dynamics requires us to define manipulation; easy for toy examples, but in real-world applications it runs head-first into nearest unblocked strategy concerns; anything that you forget to define as manipulation is fully up for grabs.
This is why I prefer directly applying a counterfactual to the human policy in my proposal, to entirely override the possibility of manipulation. But that introduces its own difficulties, and is not easy to scale up beyond the stop button. I’ve had a post in the works for a while about the main difficulty I see with my approach here.
I skip over those pragmatic problems because this post is not proposing a solution, but rather a measurement I find interesting.