HCH as a measure of manipulation

A half-baked idea that came out of a conversation with Jessica, Ryan, and Tsvi:

We’d like to have a straightforward way to define “manipulation”, so that we could instruct an AI not to manipulate its developers, or construct a low-impact measure that takes manipulation as a particularly important impact.

We could initially define manipulation in terms of a human’s expected actions, or more robustly, in terms of effects on a human’s policy distribution across a wide array of plausible environments. However, we’d like to have our AI still be able to tell us information (in a non-manipulative manner) instead of hiding from us in an effort to avoid all influence!

The title of course spoils the next idea: if the AI can reason about some suitable model of HCH, then we can define the notion of “action a has very low influence on a human, as compared to the null action, apart from conveying information x”: that over a distribution of questions q,

where HCH is defined relative to that human; we’re conditioning the distribution on whether the AI takes action a or the null action; and x,q is the input consisting of statement x followed by question q.

This of course does not exclude the use of manipulative statements x, but it at least could allow us to reduce forms of manipulation to those that would happen with the text input to HCH.

I’d prefer to have the AI reason about HCH rather than just (e.g.) the human’s actions in a one-hour simulation, because HCH can in principle capture a human’s long-term and extrapolated preferences, and these are the ones I most want to ensure don’t get manipulated.

Is there an obvious failure of this approach, an obvious improvement to it, or something simpler that it reduces to?