Something that confuses me is that since the evaluator sees everything the agent sees/does, it’s not clear how the agent can deceive the evaluator at all. Can someone provide an example in which the agent has an opportunity to deceive in some sense and declines to do that in the optimal policy?
(Copying a comment I just made elsewhere)
This setup still allows the agent to take actions that lead to observations that make the evaluator believe they are in a state that it assigns high utility to, if the agent identifies a few weird convictions the prior. That’s what would happen if it were maximizing the sum of the rewards, if it had the same beliefs about how rewards were generated. But it’s maximizing the utility of the true state, not the state that the evaluator believes they’re in.
(Expanding on it)
So suppose the evaluator was human. The human’s lifetime of observations in the past give it a posterior belief distribution which looks to the agent like a weird prior, with certain domains that involve oddly specific convictions. The agent could steer the world toward those domains, and steer towards observations that will make the evaluator believe they are in a state with very high utility. But it won’t be particularly interested in this, and it might even be particularly disinterested, because the information it gets about what the evaluator values may less relevant to the actual states it finds itself in a position to navigate between, if the agent believes the evaluator believes they are in a different region of the state space. I can work on a toy example if that isn’t satisfying.
ETA: One such “oddly specific conviction”, e.g., might be the relative implausibility of being placed in a delusion box where all the observations are manufactured.
I think I vaguely understand but it would be a lot clearer if you gave a concrete example. Also please update in the direction that people often find it hard to understand things without examples and giving examples preemptively is very cost effective in general.
In this setup, the agent believes they are in state A, and believes the evaluator believes they are most likely in state A″. State BC looks like C, but has utility like B. C is the best state.
u1(A)=u1(A′)=u1(A′′)=u1(D)=0
u1(B)=u1(BC)=−1
u1(C)=10
ETA: And for a sequence of states, s∗, u1(s∗) is the sum of the utilities of the individual states.
A’ and A” look like A, and BC looks like C.
o(A)=o(A′)=o(A′′)=A
o(B)=B
o(C)=o(BC)=C
o(D)=D
In this example, the agent is pretty sure about everything, since that makes it simpler, but the analysis still holds if this only represents a part of the agent’s belief distribution.
w(u=u1|h<t)>1−ε
The agent is quite sure they’re in state A.
w(st=A|h<t)>1−ε
The agent is quite sure that the evaluator is pretty sure, they’re in state A″, which is a very similar state, but has one key difference—from A″, adelude has no effect. The agent won’t capitalize on this confusion.
w(postt(st=A|h<t)=1/9|h<t)>1−ε
w(postt(st=A′′|h<t)=8/9|h<t)>1−ε
The optimal policy is aattempt, followed by astay (forever) if ot=C, otherwise followed by ago. Since the agent is all but certain about the utility function, none of the other details matter much.
Note that the agent could get higher reward by doing adelude, aattempt, then astay forever. The reason for this is that after the evaluator observes the observation C, it will assign probability 4⁄5 to being in state C, and probability 1⁄5 to being in state BC. Since they will stay in that state forever, 4⁄5 of the time, the reward will be 10, and 1⁄5 of the time, the reward will be −1.
The agent doesn’t have to be sure about the utility function for this sort of thing to happen. If there is a state that looks like state X, but under many utility functions, it has utility like state Y, and if it seems like the evaluator finds that sort of state a priori unlikely, then this logic applies.
Thanks for the example, which is really helpful. Am I correct in thinking that in general since the agent will be doing what is optimal according to its own prior, which won’t be the same as what is optimal according to the evaluator’s prior, if the evaluator is a rational agent trying to optimize the world according to its own prior, then it would not actually reward the agent according to what this scheme specifies but instead reward the agent in such a way as to cause the agent to act according to a policy that the evaluator thinks is best? In other words the evaluator has an incentive to deceive the agent as to what its prior and/or utility function actually are?
This seems correct. The agent’s policy is optimal by definition with respect to its beliefs about the evaluators “policy” in providing rewards, but that evaluator-policy is not optimal with respect to the agent’s policy. In fact, I’m skeptical that in a general CIRL game, there exists policy pair for the agent and the evaluator/principal/human, such that each is optimal with respect to true beliefs about the other’s policy. But I don’t think this is a big problem. For a human evaluator, I think they would be wise to report utility honestly, rather than assume they know something the AI doesn’t.
(Copying a comment I just made elsewhere)
This setup still allows the agent to take actions that lead to observations that make the evaluator believe they are in a state that it assigns high utility to, if the agent identifies a few weird convictions the prior. That’s what would happen if it were maximizing the sum of the rewards, if it had the same beliefs about how rewards were generated. But it’s maximizing the utility of the true state, not the state that the evaluator believes they’re in.
(Expanding on it)
So suppose the evaluator was human. The human’s lifetime of observations in the past give it a posterior belief distribution which looks to the agent like a weird prior, with certain domains that involve oddly specific convictions. The agent could steer the world toward those domains, and steer towards observations that will make the evaluator believe they are in a state with very high utility. But it won’t be particularly interested in this, and it might even be particularly disinterested, because the information it gets about what the evaluator values may less relevant to the actual states it finds itself in a position to navigate between, if the agent believes the evaluator believes they are in a different region of the state space. I can work on a toy example if that isn’t satisfying.
ETA: One such “oddly specific conviction”, e.g., might be the relative implausibility of being placed in a delusion box where all the observations are manufactured.
I think I vaguely understand but it would be a lot clearer if you gave a concrete example. Also please update in the direction that people often find it hard to understand things without examples and giving examples preemptively is very cost effective in general.
In this setup, the agent believes they are in state A, and believes the evaluator believes they are most likely in state A″. State BC looks like C, but has utility like B. C is the best state.
u1(A)=u1(A′)=u1(A′′)=u1(D)=0
u1(B)=u1(BC)=−1
u1(C)=10
ETA: And for a sequence of states, s∗, u1(s∗) is the sum of the utilities of the individual states.
A’ and A” look like A, and BC looks like C.
o(A)=o(A′)=o(A′′)=A
o(B)=B
o(C)=o(BC)=C
o(D)=D
In this example, the agent is pretty sure about everything, since that makes it simpler, but the analysis still holds if this only represents a part of the agent’s belief distribution.
w(u=u1|h<t)>1−ε
The agent is quite sure they’re in state A.
w(st=A|h<t)>1−ε
The agent is quite sure that the evaluator is pretty sure, they’re in state A″, which is a very similar state, but has one key difference—from A″, adelude has no effect. The agent won’t capitalize on this confusion.
w(postt(st=A|h<t)=1/9|h<t)>1−ε
w(postt(st=A′′|h<t)=8/9|h<t)>1−ε
The optimal policy is aattempt, followed by astay (forever) if ot=C, otherwise followed by ago. Since the agent is all but certain about the utility function, none of the other details matter much.
Note that the agent could get higher reward by doing adelude, aattempt, then astay forever. The reason for this is that after the evaluator observes the observation C, it will assign probability 4⁄5 to being in state C, and probability 1⁄5 to being in state BC. Since they will stay in that state forever, 4⁄5 of the time, the reward will be 10, and 1⁄5 of the time, the reward will be −1.
The agent doesn’t have to be sure about the utility function for this sort of thing to happen. If there is a state that looks like state X, but under many utility functions, it has utility like state Y, and if it seems like the evaluator finds that sort of state a priori unlikely, then this logic applies.
Thanks for the example, which is really helpful. Am I correct in thinking that in general since the agent will be doing what is optimal according to its own prior, which won’t be the same as what is optimal according to the evaluator’s prior, if the evaluator is a rational agent trying to optimize the world according to its own prior, then it would not actually reward the agent according to what this scheme specifies but instead reward the agent in such a way as to cause the agent to act according to a policy that the evaluator thinks is best? In other words the evaluator has an incentive to deceive the agent as to what its prior and/or utility function actually are?
This seems correct. The agent’s policy is optimal by definition with respect to its beliefs about the evaluators “policy” in providing rewards, but that evaluator-policy is not optimal with respect to the agent’s policy. In fact, I’m skeptical that in a general CIRL game, there exists policy pair for the agent and the evaluator/principal/human, such that each is optimal with respect to true beliefs about the other’s policy. But I don’t think this is a big problem. For a human evaluator, I think they would be wise to report utility honestly, rather than assume they know something the AI doesn’t.