I think I vaguely understand but it would be a lot clearer if you gave a concrete example. Also please update in the direction that people often find it hard to understand things without examples and giving examples preemptively is very cost effective in general.
In this setup, the agent believes they are in state A, and believes the evaluator believes they are most likely in state A″. State BC looks like C, but has utility like B. C is the best state.
u1(A)=u1(A′)=u1(A′′)=u1(D)=0
u1(B)=u1(BC)=−1
u1(C)=10
ETA: And for a sequence of states, s∗, u1(s∗) is the sum of the utilities of the individual states.
A’ and A” look like A, and BC looks like C.
o(A)=o(A′)=o(A′′)=A
o(B)=B
o(C)=o(BC)=C
o(D)=D
In this example, the agent is pretty sure about everything, since that makes it simpler, but the analysis still holds if this only represents a part of the agent’s belief distribution.
w(u=u1|h<t)>1−ε
The agent is quite sure they’re in state A.
w(st=A|h<t)>1−ε
The agent is quite sure that the evaluator is pretty sure, they’re in state A″, which is a very similar state, but has one key difference—from A″, adelude has no effect. The agent won’t capitalize on this confusion.
w(postt(st=A|h<t)=1/9|h<t)>1−ε
w(postt(st=A′′|h<t)=8/9|h<t)>1−ε
The optimal policy is aattempt, followed by astay (forever) if ot=C, otherwise followed by ago. Since the agent is all but certain about the utility function, none of the other details matter much.
Note that the agent could get higher reward by doing adelude, aattempt, then astay forever. The reason for this is that after the evaluator observes the observation C, it will assign probability 4⁄5 to being in state C, and probability 1⁄5 to being in state BC. Since they will stay in that state forever, 4⁄5 of the time, the reward will be 10, and 1⁄5 of the time, the reward will be −1.
The agent doesn’t have to be sure about the utility function for this sort of thing to happen. If there is a state that looks like state X, but under many utility functions, it has utility like state Y, and if it seems like the evaluator finds that sort of state a priori unlikely, then this logic applies.
Thanks for the example, which is really helpful. Am I correct in thinking that in general since the agent will be doing what is optimal according to its own prior, which won’t be the same as what is optimal according to the evaluator’s prior, if the evaluator is a rational agent trying to optimize the world according to its own prior, then it would not actually reward the agent according to what this scheme specifies but instead reward the agent in such a way as to cause the agent to act according to a policy that the evaluator thinks is best? In other words the evaluator has an incentive to deceive the agent as to what its prior and/or utility function actually are?
This seems correct. The agent’s policy is optimal by definition with respect to its beliefs about the evaluators “policy” in providing rewards, but that evaluator-policy is not optimal with respect to the agent’s policy. In fact, I’m skeptical that in a general CIRL game, there exists policy pair for the agent and the evaluator/principal/human, such that each is optimal with respect to true beliefs about the other’s policy. But I don’t think this is a big problem. For a human evaluator, I think they would be wise to report utility honestly, rather than assume they know something the AI doesn’t.
I think I vaguely understand but it would be a lot clearer if you gave a concrete example. Also please update in the direction that people often find it hard to understand things without examples and giving examples preemptively is very cost effective in general.
In this setup, the agent believes they are in state A, and believes the evaluator believes they are most likely in state A″. State BC looks like C, but has utility like B. C is the best state.
u1(A)=u1(A′)=u1(A′′)=u1(D)=0
u1(B)=u1(BC)=−1
u1(C)=10
ETA: And for a sequence of states, s∗, u1(s∗) is the sum of the utilities of the individual states.
A’ and A” look like A, and BC looks like C.
o(A)=o(A′)=o(A′′)=A
o(B)=B
o(C)=o(BC)=C
o(D)=D
In this example, the agent is pretty sure about everything, since that makes it simpler, but the analysis still holds if this only represents a part of the agent’s belief distribution.
w(u=u1|h<t)>1−ε
The agent is quite sure they’re in state A.
w(st=A|h<t)>1−ε
The agent is quite sure that the evaluator is pretty sure, they’re in state A″, which is a very similar state, but has one key difference—from A″, adelude has no effect. The agent won’t capitalize on this confusion.
w(postt(st=A|h<t)=1/9|h<t)>1−ε
w(postt(st=A′′|h<t)=8/9|h<t)>1−ε
The optimal policy is aattempt, followed by astay (forever) if ot=C, otherwise followed by ago. Since the agent is all but certain about the utility function, none of the other details matter much.
Note that the agent could get higher reward by doing adelude, aattempt, then astay forever. The reason for this is that after the evaluator observes the observation C, it will assign probability 4⁄5 to being in state C, and probability 1⁄5 to being in state BC. Since they will stay in that state forever, 4⁄5 of the time, the reward will be 10, and 1⁄5 of the time, the reward will be −1.
The agent doesn’t have to be sure about the utility function for this sort of thing to happen. If there is a state that looks like state X, but under many utility functions, it has utility like state Y, and if it seems like the evaluator finds that sort of state a priori unlikely, then this logic applies.
Thanks for the example, which is really helpful. Am I correct in thinking that in general since the agent will be doing what is optimal according to its own prior, which won’t be the same as what is optimal according to the evaluator’s prior, if the evaluator is a rational agent trying to optimize the world according to its own prior, then it would not actually reward the agent according to what this scheme specifies but instead reward the agent in such a way as to cause the agent to act according to a policy that the evaluator thinks is best? In other words the evaluator has an incentive to deceive the agent as to what its prior and/or utility function actually are?
This seems correct. The agent’s policy is optimal by definition with respect to its beliefs about the evaluators “policy” in providing rewards, but that evaluator-policy is not optimal with respect to the agent’s policy. In fact, I’m skeptical that in a general CIRL game, there exists policy pair for the agent and the evaluator/principal/human, such that each is optimal with respect to true beliefs about the other’s policy. But I don’t think this is a big problem. For a human evaluator, I think they would be wise to report utility honestly, rather than assume they know something the AI doesn’t.