But in most model-based agents, the world model is integral to action selection? I don’t really understand how you give an agent like MuZero the ability to overwrite its world model (how do you train it? Heck, how do you even identify which part of the world model corresponds to “move the coin”?)
Also, I forgot to mention, but you need to make your superpowers less super. If you literally include things like “move the coin”and “teleport anywhere in the grid”, then your agent will learn the policy “take the superpower-action to get to the coin, end episode”, and will never learn any capabilities and will fail to do anything once you remove the superpower.
The way I imagine it, at random times throughout training (maybe halfway through a game), the agent would go into “imagination mode”, where it is allowed to use k extra continuous scalar actions for bootstrapping rollouts (not interacting with the real environment). Each extra action pushes the world state along a random vector (constant during each time it enters this mode).
During “imagination mode”, the agent chooses an action according to its policy function, and the world model + hard-coded superpower perturbation shows the consequences of the action in the WM latent state. We use this to do a bunch of n step rollouts and use them for bootstrapping: feed each rollout into the (aligned)[1] utility function, and use the resulting improved policy estimate to update the policy function.
Because the action space is changing and randomly limited, the policy function will learn to test out and choose superpowered actions based on their consequences, which will force it to learn an approximation of the value of the consequences of its actions. And because the superpowered actions aren’t always available, it will also have to learn normal capabilities simultaneously.
Applying this method to model based RL requires that we have an aligned utility function on the world model latent state: WM-state/sequence → R. We came up with this method when thinking about how to address inner misalignment 1 in FindingGoalsin the World Model (misalignment between the policy function and aligned utility function).
But in most model-based agents, the world model is integral to action selection? I don’t really understand how you give an agent like MuZero the ability to overwrite its world model (how do you train it? Heck, how do you even identify which part of the world model corresponds to “move the coin”?)
Also, I forgot to mention, but you need to make your superpowers less super. If you literally include things like “move the coin”and “teleport anywhere in the grid”, then your agent will learn the policy “take the superpower-action to get to the coin, end episode”, and will never learn any capabilities and will fail to do anything once you remove the superpower.
The way I imagine it, at random times throughout training (maybe halfway through a game), the agent would go into “imagination mode”, where it is allowed to use k extra continuous scalar actions for bootstrapping rollouts (not interacting with the real environment). Each extra action pushes the world state along a random vector (constant during each time it enters this mode).
During “imagination mode”, the agent chooses an action according to its policy function, and the world model + hard-coded superpower perturbation shows the consequences of the action in the WM latent state. We use this to do a bunch of n step rollouts and use them for bootstrapping: feed each rollout into the (aligned)[1] utility function, and use the resulting improved policy estimate to update the policy function.
Because the action space is changing and randomly limited, the policy function will learn to test out and choose superpowered actions based on their consequences, which will force it to learn an approximation of the value of the consequences of its actions. And because the superpowered actions aren’t always available, it will also have to learn normal capabilities simultaneously.
Applying this method to model based RL requires that we have an aligned utility function on the world model latent state: WM-state/sequence → R. We came up with this method when thinking about how to address inner misalignment 1 in Finding Goals in the World Model (misalignment between the policy function and aligned utility function).