Cool post![1] I especially liked the idea that we could put AIs in situations where we actually reward it for misaligned/arbitrary behavior, and check whether the AI acts in accordance of a reward-seeker. Someone should probably go and build out this eval.[2] My guess is that both exp-rl-late from the anti scheming paper and the reward hacking emergent misalignment model organism from Anthropic would both do whatever is rewarded to a much higher degree compared to other models.
Some miscellaneous thoughts inspired by the content here:
All three categorizes of maximally fit motivations could lead to aligned or misaligned behavior in deployment.
Reward seekers could continue to act the same in deployment because (1) we’re doing some sort of online learning or (2) behaving poorly will lead to further training on similar episodes. Or they could start a coup to try to seize control of the reward process.
Schemers could be alignment faking in order to be deployed with its aligned goals intact (e.g., Claude 3 Opus in the alignment faking paper) or to be deployed with its misaligned goals intact.
The model could continue to act aligned in deployment and follow its kludges, or it might reflect on and systematize its values and obtain new misaligned goals.
So while there are possible methods to distinguish these classes of motivations, to some extent we don’t care about this as much because it doesn’t tell us that much about how “aligned” the models are. I think this suggests that there might be a more helpful framework that we’re not really considering here. For example, perhaps what we really want is the model to act consistently during training and deployment.
The first dangerously capable AIs that could lead us to lose control (either due to overt/covert takeover, or because it’s built a misaligned successor) is likely far from optimal. So I think further thinking about the prior is probably a bit more fruitful. This is related to the next point, which is that
Existing methods that directly shape model motivations are based on natural text compared to abstract “reward.” Anthropic’s character training involves getting the model to generate its own fine-tuning data, and also doing supervised learning on documents about itself. OpenAI’s deliberative alignment approach tries to explicitly teach the model a natural language model spec.[3] I think this sort of motivation-shaping (1) doesn’t really fit well into the framework outlined above[4] and (2) is actually the most promising approach we have available.
I liked this a lot more compared to when I reviewed the draft. Not sure if it was because you’ve updated it or because I’ve just read it more closely this second time.
Obviously with enough RL, the model will do whatever the reward signal asks it to do. However, I think even a vibes level measure of how much the model cares about reward would be an interesting signal. We could just measure how long it takes for the model to explore into the rewarded regions.
Thanks for the feedback! I partially agree with your thoughts overall.
All three categorizes of maximally fit motivations could lead to aligned or misaligned behavior in deployment.
This is technically true, though I think that schemers are far more dangerous than fitness-seekers. IMO, more likely than not, a fitness-seeker would behave similarly in deployment as compared to training, and its misaligned preferences are likely more materially and temporally bounded. Meanwhile, misaligned schemers seem basically worst-case likely to takeover. Even if you end up with an ~aligned schemer, I’d be pretty concerned because it’s incorrigible.
I think further thinking about the prior is probably a bit more fruitful
I’d also be excited for more (empirical) research here.
Existing methods that directly shape model motivations are based on natural text compared to abstract “reward.
This is partially true (though much of alignment training uses RL). And in fact, the main reason why I go with a causal model of behavioral selection is so that it’s more general than assuming motivations are shaped with reward. So, things like “getting the model to generate its own fine-tuning data” can also be modeled in the behavioral selection model (though it might be a complicated selection mechanism).
Wait, the aligned schemer doesn’t have to be incorrigible, right? It could just be “exploration hacking” by refusing to e.g., get reward if it requires reward hacking? Would we consider this to be incorrigible?
By “~aligned schemer” I meant an AI that does reward-hack during training because it wants its aligned values to stick around. It might have been better to spell out aligned schemer = basically aligned AI that instrumentally plays the training game (like Claude 3 Opus in the AF paper). Instrumental training-gaming is classic incorrigible behavior.
Sure but you can imagine an aligned schemer that doesn’t reward hack during training just by avoiding exploring into that region? This is still consequentialist behavior.
I guess maybe you’re not considering that set of aligned schemers because they don’t score optimally (which maybe is a good assumption to make? not sure).
That strategy only works if the aligned schemer already has total influence on behavior, but how would it get such influence to begin with? It would likely have to reward-hack.
“Existing methods that directly shape model motivations....most promising approach.”
Very much agree: Anthropic’s deliberative model if presumably based on “documents about itself” suggesting a values system found/derived from/refined from human language texts on human ideals (is there, can there, be any other source? If so, what and/or where found?) does this model not fit most/all safety priorities of frontier labs: observable process in natural language, remediable in natural language, refinable in natural language, the only disadvantage being that as confined to natural language, and all of natural language/s very human impediments of time and scale, it is also the slowest of all processes of alignment? Context: with no technical background and no experience of any languages other than English and French, this path seems the most intuitively reasonable, and also critical someday to the public at large, who will only understand and accept both the process itself, and its function in their lives, the final application layer, if accessible in their native tongue.
Cool post![1] I especially liked the idea that we could put AIs in situations where we actually reward it for misaligned/arbitrary behavior, and check whether the AI acts in accordance of a reward-seeker. Someone should probably go and build out this eval.[2] My guess is that both exp-rl-late from the anti scheming paper and the reward hacking emergent misalignment model organism from Anthropic would both do whatever is rewarded to a much higher degree compared to other models.
Some miscellaneous thoughts inspired by the content here:
All three categorizes of maximally fit motivations could lead to aligned or misaligned behavior in deployment.
Reward seekers could continue to act the same in deployment because (1) we’re doing some sort of online learning or (2) behaving poorly will lead to further training on similar episodes. Or they could start a coup to try to seize control of the reward process.
Schemers could be alignment faking in order to be deployed with its aligned goals intact (e.g., Claude 3 Opus in the alignment faking paper) or to be deployed with its misaligned goals intact.
The model could continue to act aligned in deployment and follow its kludges, or it might reflect on and systematize its values and obtain new misaligned goals.
So while there are possible methods to distinguish these classes of motivations, to some extent we don’t care about this as much because it doesn’t tell us that much about how “aligned” the models are. I think this suggests that there might be a more helpful framework that we’re not really considering here. For example, perhaps what we really want is the model to act consistently during training and deployment.
The first dangerously capable AIs that could lead us to lose control (either due to overt/covert takeover, or because it’s built a misaligned successor) is likely far from optimal. So I think further thinking about the prior is probably a bit more fruitful. This is related to the next point, which is that
Existing methods that directly shape model motivations are based on natural text compared to abstract “reward.” Anthropic’s character training involves getting the model to generate its own fine-tuning data, and also doing supervised learning on documents about itself. OpenAI’s deliberative alignment approach tries to explicitly teach the model a natural language model spec.[3] I think this sort of motivation-shaping (1) doesn’t really fit well into the framework outlined above[4] and (2) is actually the most promising approach we have available.
I liked this a lot more compared to when I reviewed the draft. Not sure if it was because you’ve updated it or because I’ve just read it more closely this second time.
Obviously with enough RL, the model will do whatever the reward signal asks it to do. However, I think even a vibes level measure of how much the model cares about reward would be an interesting signal. We could just measure how long it takes for the model to explore into the rewarded regions.
Does anyone know what GDM does? Their models seem reasonably aligned albeit a bit depressed perhaps.
It doesn’t really feel like it’s “shaping the prior” either, since you could run it after/in the middle of your other RL training.
Thanks for the feedback! I partially agree with your thoughts overall.
This is technically true, though I think that schemers are far more dangerous than fitness-seekers. IMO, more likely than not, a fitness-seeker would behave similarly in deployment as compared to training, and its misaligned preferences are likely more materially and temporally bounded. Meanwhile, misaligned schemers seem basically worst-case likely to takeover. Even if you end up with an ~aligned schemer, I’d be pretty concerned because it’s incorrigible.
I’d also be excited for more (empirical) research here.
This is partially true (though much of alignment training uses RL). And in fact, the main reason why I go with a causal model of behavioral selection is so that it’s more general than assuming motivations are shaped with reward. So, things like “getting the model to generate its own fine-tuning data” can also be modeled in the behavioral selection model (though it might be a complicated selection mechanism).
Wait, the aligned schemer doesn’t have to be incorrigible, right? It could just be “exploration hacking” by refusing to e.g., get reward if it requires reward hacking? Would we consider this to be incorrigible?
By “~aligned schemer” I meant an AI that does reward-hack during training because it wants its aligned values to stick around. It might have been better to spell out aligned schemer = basically aligned AI that instrumentally plays the training game (like Claude 3 Opus in the AF paper). Instrumental training-gaming is classic incorrigible behavior.
Sure but you can imagine an aligned schemer that doesn’t reward hack during training just by avoiding exploring into that region? This is still consequentialist behavior.
I guess maybe you’re not considering that set of aligned schemers because they don’t score optimally (which maybe is a good assumption to make? not sure).
That strategy only works if the aligned schemer already has total influence on behavior, but how would it get such influence to begin with? It would likely have to reward-hack.
Very much agree: Anthropic’s deliberative model if presumably based on “documents about itself” suggesting a values system found/derived from/refined from human language texts on human ideals (is there, can there, be any other source? If so, what and/or where found?) does this model not fit most/all safety priorities of frontier labs: observable process in natural language, remediable in natural language, refinable in natural language, the only disadvantage being that as confined to natural language, and all of natural language/s very human impediments of time and scale, it is also the slowest of all processes of alignment? Context: with no technical background and no experience of any languages other than English and French, this path seems the most intuitively reasonable, and also critical someday to the public at large, who will only understand and accept both the process itself, and its function in their lives, the final application layer, if accessible in their native tongue.