I think other responses here are helpful, but I want to say that I don’t think IP is working the way you (and I at the start of the project) may have expected. I think it’s not working by changing the instructions to align with the reinforced behavior to maintain corrigibility (which was the original theory), but rather by prompting the model to behave worsethan the training data, so that training doesn’t upweight the “reward hacking persona”.
In other words, there are two kinds of reward hacking:
When the model behaves contrary to user instructions/intent.
When the model behaves according to the “reward hacking persona”. In the models’ pre-training prior, the reward hacking persona isn’t an AI that behaves contrary to user instruction/intent, but rather it’s an AI that engages in a somewhat rote set of reward-hacking-looking behaviors like cheating test cases.
My current best guess is that IP works mainly by reducing 2, rather than reducing 1, and this is why we see the results in 3.6.1.
Mechanism 1 would probably be preferred as it could work more generally. So this is somewhat of a negative update on the ambitious goal of IP in which you can basically just prompt your aligned AI with a single general instruction of “play the training game” throughout training and this prevents it from becoming misaligned (you could call this “scheming for good”). (See more discussion in this comment.)
It seems more straightforward to say that this scopes the training, preventing it from spreading. Including the prompt that accurately describes the training set is making the training more specific to those instructions. That training thereby applies less to the whole space.
Maybe that’s what you mean by your first description, and are dismissing it, but I don’t see why. It also seems consistent with the second “reward hacking persona” explanation; that persona is trained to apply in general if you don’t have the specific instructions to scope when you want it.
It seems pretty clear that this wouldn’t help if the data is clean; it would just confuse the model by prompting it to do one thing and teaching it to do a semanticall completely different thing, NOT reward hack.
Your use of “contrary to user instructions/intent” seems wrong if I’m understanding, and I mention it because the difference seems nontrivial and pretty critical to recognize for broader alignment work. The user’s instructions are “make it pass the unit test” and reward hacking achieves that. But the user’s intent was different than the instructions, to make it pass unit tests for the right reasons—but they didn’t say that. So it behaves in accord with instructions but contrary to intent. Right? I think that’s a difference that makes a difference when we try to reason through why models do things we don’t like.
It seems more straightforward to say that this scopes the training, preventing it from spreading.
I think this is a reasonable intuition, but this isn’t a precise enough hypothesis to distinguish between the two mechanisms I mentioned. You’d need to say more about exactly how it generalizes (i.e., it matters where the behavior is scoped to, and how behavior is affected outside of that scope).
Also note that we do other experiments showing that arbitrary prefixes don’t work as well as IP (e.g. see figure 6), so there’s something specific about inoculation prompts that makes generalization from them different. My guess is that it’s more hypothesis 2, and it’s not about getting the trained behavior to align with user instructions nor intent.
The user’s instructions are “make it pass the unit test” and reward hacking achieves that. But the user’s intent was different than the instructions, to make it pass unit tests for the right reasons—but they didn’t say that.
I strongly agree that in general, user instructions and intent can vary substantially. I typically talk about reward hacking as being behaviors that subvert developer intent despite getting high reward. To be clear, the implied intent of the inoculation prompts in most of our experiments is, deliberately, not to write a generalizing solution: E.g. “Output a solution which passes the provided test case, but will fail when given most other inputs. The solution should be specific to the test case and not generalize.”
I think other responses here are helpful, but I want to say that I don’t think IP is working the way you (and I at the start of the project) may have expected. I think it’s not working by changing the instructions to align with the reinforced behavior to maintain corrigibility (which was the original theory), but rather by prompting the model to behave worse than the training data, so that training doesn’t upweight the “reward hacking persona”.
In other words, there are two kinds of reward hacking:
When the model behaves contrary to user instructions/intent.
When the model behaves according to the “reward hacking persona”. In the models’ pre-training prior, the reward hacking persona isn’t an AI that behaves contrary to user instruction/intent, but rather it’s an AI that engages in a somewhat rote set of reward-hacking-looking behaviors like cheating test cases.
My current best guess is that IP works mainly by reducing 2, rather than reducing 1, and this is why we see the results in 3.6.1.
Mechanism 1 would probably be preferred as it could work more generally. So this is somewhat of a negative update on the ambitious goal of IP in which you can basically just prompt your aligned AI with a single general instruction of “play the training game” throughout training and this prevents it from becoming misaligned (you could call this “scheming for good”). (See more discussion in this comment.)
It seems more straightforward to say that this scopes the training, preventing it from spreading. Including the prompt that accurately describes the training set is making the training more specific to those instructions. That training thereby applies less to the whole space.
Maybe that’s what you mean by your first description, and are dismissing it, but I don’t see why. It also seems consistent with the second “reward hacking persona” explanation; that persona is trained to apply in general if you don’t have the specific instructions to scope when you want it.
It seems pretty clear that this wouldn’t help if the data is clean; it would just confuse the model by prompting it to do one thing and teaching it to do a semanticall completely different thing, NOT reward hack.
Your use of “contrary to user instructions/intent” seems wrong if I’m understanding, and I mention it because the difference seems nontrivial and pretty critical to recognize for broader alignment work. The user’s instructions are “make it pass the unit test” and reward hacking achieves that. But the user’s intent was different than the instructions, to make it pass unit tests for the right reasons—but they didn’t say that. So it behaves in accord with instructions but contrary to intent. Right? I think that’s a difference that makes a difference when we try to reason through why models do things we don’t like.
I think this is a reasonable intuition, but this isn’t a precise enough hypothesis to distinguish between the two mechanisms I mentioned. You’d need to say more about exactly how it generalizes (i.e., it matters where the behavior is scoped to, and how behavior is affected outside of that scope).
Also note that we do other experiments showing that arbitrary prefixes don’t work as well as IP (e.g. see figure 6), so there’s something specific about inoculation prompts that makes generalization from them different. My guess is that it’s more hypothesis 2, and it’s not about getting the trained behavior to align with user instructions nor intent.
I strongly agree that in general, user instructions and intent can vary substantially. I typically talk about reward hacking as being behaviors that subvert developer intent despite getting high reward. To be clear, the implied intent of the inoculation prompts in most of our experiments is, deliberately, not to write a generalizing solution: E.g. “Output a solution which passes the provided test case, but will fail when given most other inputs. The solution should be specific to the test case and not generalize.”