Probably the best thing I’ve seen is a proposal which I think originates from Stuart Armstrong, which is that you simply remove the manipulative causal pathways from your model before making decisions. I’m not sure how you are supposed to identify which pathways are manipulative vs non-manipulative, in order to remove them, but if you can, you get a notion of optimizing without manipulating.
Personally, I’m optimistic about this; this is my current line of research. I think it could be made consistent. I have a 20-page draft about this, if you’re interested.
Have you thought about defining manipulation in terms of infiltration across Human Markov blankets? Cf «Boundaries», Part 3a: Defining boundaries as directed Markov blankets.
Personally, I’m optimistic about this; this is my current line of research. I think it could be made consistent. I have a 20-page draft about this, if you’re interested.
This is also something that’s part of Davidad’s current alignment plan: «Boundaries» for formalizing a bare-bones morality.