As an observation, it seems like part of the problem in this example is that the agent has access to different actions than the supervisor. The supervisor cannot move to s2 (and therefore cannot provide any information about the reward difference, as noted), but the agent can easily do so. If this were not the case, it would not matter what the agent believed about s2.
What happens in scenarios where you restrict the set of actions available to the agent so that it matches those available to the supervisor?
That is a good question. I don’t think it is essential that the agent can move from s1 to s2, only that the agent is able to force a stay in s2 if it wants to.
The transition from s1 to s2 could instead happen randomly with some probability.
The important thing is that the human’s action in s1 does not reveal any information about s2.
As an observation, it seems like part of the problem in this example is that the agent has access to different actions than the supervisor. The supervisor cannot move to s2 (and therefore cannot provide any information about the reward difference, as noted), but the agent can easily do so. If this were not the case, it would not matter what the agent believed about s2.
What happens in scenarios where you restrict the set of actions available to the agent so that it matches those available to the supervisor?
That is a good question. I don’t think it is essential that the agent can move from s1 to s2, only that the agent is able to force a stay in s2 if it wants to.
The transition from s1 to s2 could instead happen randomly with some probability.
The important thing is that the human’s action in s1 does not reveal any information about s2.