As others have commented, it’s difficult to understand what this math is supposed to say.
My understanding is that the sole central idea here is to have the agent know that the utility/reward it is given is a function of the evaluator’s distribution over the state, but to try to maximize the utility that the evaluator would allocate if it knew the true state.
But this may be inaccurate, or there may be other material ideas here that I’ve missed.
It seems pretty related to the Inverse Reward Design paper. I guess it’s a variation. Your setup seems to be more specific about how the evaluator acts, but more general about the environment.
A bit of a nitpick: IRD and this formulate how the agent believes the evaluator acts, while being technically agnostic about how the evaluator actually acts (at least in the specification of the algorithm; experiments/theory might be predicated on additional assumptions about the evaluator).
I believe this agent’s beliefs about how the evaluator acts are much more general than IRD. If the agent believed the evaluator was certain about which environment they were in, and it was the “training environment” from IRD, this agent would probably behave very similarly to an IRD agent. But of course, this agent considers many more possibilities for what the evaluator’s beliefs might be.
I agree this agent should definitely be compared to IRD, since they are both agents who don’t “take rewards literally”, but rather process them in some way first. Note that the design space of things which fit this description is quite large.
As others have commented, it’s difficult to understand what this math is supposed to say.
My understanding is that the sole central idea here is to have the agent know that the utility/reward it is given is a function of the evaluator’s distribution over the state, but to try to maximize the utility that the evaluator would allocate if it knew the true state.
But this may be inaccurate, or there may be other material ideas here that I’ve missed.
Yep.
Ok! That’s very useful to know.
It seems pretty related to the Inverse Reward Design paper. I guess it’s a variation. Your setup seems to be more specific about how the evaluator acts, but more general about the environment.
A bit of a nitpick: IRD and this formulate how the agent believes the evaluator acts, while being technically agnostic about how the evaluator actually acts (at least in the specification of the algorithm; experiments/theory might be predicated on additional assumptions about the evaluator).
I believe this agent’s beliefs about how the evaluator acts are much more general than IRD. If the agent believed the evaluator was certain about which environment they were in, and it was the “training environment” from IRD, this agent would probably behave very similarly to an IRD agent. But of course, this agent considers many more possibilities for what the evaluator’s beliefs might be.
I agree this agent should definitely be compared to IRD, since they are both agents who don’t “take rewards literally”, but rather process them in some way first. Note that the design space of things which fit this description is quite large.