I like this post, it seems to be the same sort of approach that I suggested here. However, your proposal seems to have a number of issues; some of which you’ve already discussed, some of which are handled in my proposal, and some of which I think are still open questions. Presumably a lot of it is just because it’s still a toy model, but I wanted to point out some things.
Definition: CorrigibilityPM, formal.Let n be a time step which is greater than t. The policy-modification corrigibility of πAIt from starting state st by time n is the maximum possible mutual information between the human policy and the AI’s policy at time n:CorrigibilityPM(πAIt∣st,n):=max→p(Πhuman)I(πHt;πAIn∣∣current state st,current AI policy πAIt).
Definition: CorrigibilityPM, formal.
Let n be a time step which is greater than t. The policy-modification corrigibility of πAIt from starting state st by time n is the maximum possible mutual information between the human policy and the AI’s policy at time n:
(As I understand, the maximum ranges over all possible distributions of human policies? Otherwise I’m not sure how to parse it, and aspects of my comment might be confused/wrong.)
Usually one would come up with these sorts of definitions in order to select on them. That is, one would incorporate corrigibility in a utility function in order to select a desired AI.
(Though on reflection, maybe that is not your plan, since e.g. your symmetry-based proofs can work for describing side-effects? Like the proofs that most goals favored power-seeking policies did not actually involve optimizing power-seekingness.)
However, this definition of corrigibility cannot immediately be incorporated into a utility function, as it depends on the time step n.
There are several possible ways to turn this into a utility function, with (I think?) two major axes of variation:
Should we pick some specific constant n, or sum over all n?
Humans policies most likely are not accurately modelled using AS due to factors like memory. To the AI, this can look like the human policy changing over time or similar. So that raises the question of whether it is only the starting policy that it should be corrigible to, or if corrigibility should e.g. be expressed as a sum over time or something. (Neither is great. Though obviously this is a toy example, so that may be expected.)
If the environment doesn’t allow the human to reach or modify the AI, the AI is incorrigible. Conversely, in some environments there does not exist an incorrigible AI policy for reasonable Πhuman.
I think “reasonable Πhuman” is really hard to talk about. Consider locking the human in a box with a password-locked computer, where the computer contains full options for controlling the AI policy. This only requires the human to enter the password, and then they will have an enormous influence over the AI. So this is highly corrigible, in a way. This is probably what we want to exclude from Πhuman, but it seems difficult.
Furthermore, this definition doesn’t necessarily capture other kinds of corrigibility, such as “the AI will do what the human asks.″ Maximizing mutual information only means that the human has many cognitively accessible ways to modify the agent. This doesn’t mean the AI does what the human asks. One way this could happen is if the AI implements the opposite of whatever the human specifies (e.g. the human-communicated policy goes left, the new AI policy goes right). Whether this is feasible depends on the bridging law f, which is not controlled by either player.
I think this is a bigger problem with the proposal than it might look like?
Suppose the AI is trying to be corrigible in the way described in the post. This makes it incentivized to find ways to let the human alter its policy. But if it allows too impactful changes, then that would prevent it from further finding ways to let the human alter its policy. So it is incentivized to first allow changes to irrelevant cases, such as the AI’s reaction to states that will never happen. Further, it doesn’t have to be responsive to a policy that the human would actually be likely to take, since you take the maximum over →p(Πhuman) in defining corrigibility. Rather, it could pick →p(Πhuman) to be a distribution of policies that humans would never engage in, such as policies that approximately (but far from totally) minimize human welfare. “I will do what you ask, as long as you enter my eternal torture chamber” would be highly corrigible by this definition. This sort of thing seems likely incentivized by this approach, because it reduces the likelihood that the corrigibility will become an obstacle to its future actions.
Also, it is not very viable to actually control the AI with corrigibility that depends on the mutual information with the AI’s policy, because the policy is very far removed from the effects of the policy.
The biggest disconnect is that this post is not a proposal for how to solve corrigibility. I’m just thinking about what corrigibility is/should be, and this seems like a shard of it—but only a shard. I’ll edit the post to better communicate that.
So, your points are good, but they run skew to what I was thinking about while writing the post.