A paper which does for deceptive alignment what the goal misgeneralization paper does for inner alignment, i.e. describing it in ML language and setting up toy examples (for example, telling GPT-3 to take actions which minimize changes in its weights, given that it’s being trained using actor-critic RL with a certain advantage function, and seeing if it knows how to do so).
If I’m understanding this one right OpenAI did something similar to this for purely pragmatic reasons with VPT, a minecraft agent. They first trained a “foundation model” to imitate humans playing. Then, since their chosen task was to craft a diamond pickaxe, they finetuned it with RL, giving it reward for each step along the path to a diamond pickaxe that it successfully took. There’s a problem with this approach:
”A major problem when fine-tuning with RL is catastrophic forgetting because previously learned skills can be lost before their value is realized. For instance, while our VPT foundation model never exhibits the entire sequence of behaviors required to smelt iron zero-shot, it did train on examples of players smelting with furnaces. It therefore may have some latent ability to smelt iron once the many prerequisites to do so have been performed. To combat the catastrophic forgetting of latent skills such that they can continually improve exploration throughout RL fine-tuning, we add an auxiliary Kullback-Leibler (KL) divergence loss between the RL model and the frozen pretrained policy.”
In other words they first trained it to imitate humans, and saved this model; after that they trained the agent to maximize reward, but used the saved model and penalized it for deviating too far from what that human-imitator model would do. Basically saying “try to maximize reward but don’t get too weird about it.”
A paper which does for deceptive alignment what the goal misgeneralization paper does for inner alignment, i.e. describing it in ML language and setting up toy examples (for example, telling GPT-3 to take actions which minimize changes in its weights, given that it’s being trained using actor-critic RL with a certain advantage function, and seeing if it knows how to do so).
If I’m understanding this one right OpenAI did something similar to this for purely pragmatic reasons with VPT, a minecraft agent. They first trained a “foundation model” to imitate humans playing. Then, since their chosen task was to craft a diamond pickaxe, they finetuned it with RL, giving it reward for each step along the path to a diamond pickaxe that it successfully took. There’s a problem with this approach:
”A major problem when fine-tuning with RL is catastrophic forgetting because previously learned skills can be lost before their value is realized. For instance, while our VPT foundation model never exhibits the entire sequence of behaviors required to smelt iron zero-shot, it did train on examples of players smelting with furnaces. It therefore may have some latent ability to smelt iron once the many prerequisites to do so have been performed. To combat the catastrophic forgetting of latent skills such that they can continually improve exploration throughout RL fine-tuning, we add an auxiliary Kullback-Leibler (KL) divergence loss between the RL model and the frozen pretrained policy.”
In other words they first trained it to imitate humans, and saved this model; after that they trained the agent to maximize reward, but used the saved model and penalized it for deviating too far from what that human-imitator model would do. Basically saying “try to maximize reward but don’t get too weird about it.”