I had a thought about a potential reward function for Neutrality+. For each batch you would:
Run the agent in the environment many times to build a dataset of episodes. The environment would have ways of getting shut down interwoven with ways of getting reward
For each trajectory length, select 10 episodes from the dataset where the agent was shutdown at that trajectory length
Sum the reward across all of the selected episodes (10 * the number of trajectory lengths); that would be the reward for the batch
The idea is that the agent’s reward will be situated on a guarantee of having reached each trajectory length a constant number of times so it will have no incentive to affect the likelihood of any given trajectory, but it will be incentivized to increase its reward given any trajectory length.
Adjustments:
Change the value 10 to be constants that could be different for each trajectory length. These constants will reflect the relative weighting of the utility for that trajectory length. They will also reflect the priority of sampling data for that trajectory length. Additional constants can be added to make the sum a weighted sum (these constants would only affect the relative weighting of the utility) and these constants could be played against the others so that the optimal data priority and the optimal utility weighting for each trajectory could be achieved.
Change the constant number of episodes per trajectory length to a probability of sampling an episode of that length. Then, simply choose a random sample where each episode has a constant probability of coming from a particular trajectory length. This would allow one to keep the batch size from exploding as a function of the number of potential trajectory lengths.
I am curious as to how this scheme lines up with your plan for testing Neutrality+. My reading was that the plan is closer to building in training for the Ramsey Yardstick into DReST, but I couldn’t quite workout how I would do that.
I had a thought about a potential reward function for Neutrality+. For each batch you would:
Run the agent in the environment many times to build a dataset of episodes. The environment would have ways of getting shut down interwoven with ways of getting reward
For each trajectory length, select 10 episodes from the dataset where the agent was shutdown at that trajectory length
Sum the reward across all of the selected episodes (10 * the number of trajectory lengths); that would be the reward for the batch
The idea is that the agent’s reward will be situated on a guarantee of having reached each trajectory length a constant number of times so it will have no incentive to affect the likelihood of any given trajectory, but it will be incentivized to increase its reward given any trajectory length.
Adjustments:
Change the value 10 to be constants that could be different for each trajectory length. These constants will reflect the relative weighting of the utility for that trajectory length. They will also reflect the priority of sampling data for that trajectory length. Additional constants can be added to make the sum a weighted sum (these constants would only affect the relative weighting of the utility) and these constants could be played against the others so that the optimal data priority and the optimal utility weighting for each trajectory could be achieved.
Change the constant number of episodes per trajectory length to a probability of sampling an episode of that length. Then, simply choose a random sample where each episode has a constant probability of coming from a particular trajectory length. This would allow one to keep the batch size from exploding as a function of the number of potential trajectory lengths.
I am curious as to how this scheme lines up with your plan for testing Neutrality+. My reading was that the plan is closer to building in training for the Ramsey Yardstick into DReST, but I couldn’t quite workout how I would do that.
Oh nice! I like this idea. Let’s talk about it more tomorrow.