Reward probably IS an optimization target of RL agent if this agent knows some details of the training setup. Surely it would enhance its reward acquisition to factor this knowledge in? Then it gets reinforced, and then couple steps down that path agent thinks full time about quirks of its reward signal.
Could be bad at it, muddy, sure. Or schemey and hack the reward to get something else that is not the reward. But that’s somewhat different thing than mainline thing? like, it’s not as likely and a lot more diverse set of possibilities, imo.
The question of what happens after the end of the training is more like a free parameter here. “Do reward seeking behaviors according to your reasoning about the reward allocation process” becomes undefined when there is none and the agent knows it.
Reward probably IS an optimization target of RL agent if this agent knows some details of the training setup. Surely it would enhance its reward acquisition to factor this knowledge in? Then it gets reinforced, and then couple steps down that path agent thinks full time about quirks of its reward signal.
Could be bad at it, muddy, sure. Or schemey and hack the reward to get something else that is not the reward. But that’s somewhat different thing than mainline thing? like, it’s not as likely and a lot more diverse set of possibilities, imo.
The question of what happens after the end of the training is more like a free parameter here. “Do reward seeking behaviors according to your reasoning about the reward allocation process” becomes undefined when there is none and the agent knows it.