Thanks for the feedback! I partially agree with your thoughts overall.
All three categorizes of maximally fit motivations could lead to aligned or misaligned behavior in deployment.
This is technically true, though I think that schemers are far more dangerous than fitness-seekers. IMO, more likely than not, a fitness-seeker would behave similarly in deployment as compared to training, and its misaligned preferences are likely more materially and temporally bounded. Meanwhile, misaligned schemers seem basically worst-case likely to takeover. Even if you end up with an ~aligned schemer, I’d be pretty concerned because it’s incorrigible.
I think further thinking about the prior is probably a bit more fruitful
I’d also be excited for more (empirical) research here.
Existing methods that directly shape model motivations are based on natural text compared to abstract “reward.
This is partially true (though much of alignment training uses RL). And in fact, the main reason why I go with a causal model of behavioral selection is so that it’s more general than assuming motivations are shaped with reward. So, things like “getting the model to generate its own fine-tuning data” can also be modeled in the behavioral selection model (though it might be a complicated selection mechanism).
Thanks for the feedback! I partially agree with your thoughts overall.
This is technically true, though I think that schemers are far more dangerous than fitness-seekers. IMO, more likely than not, a fitness-seeker would behave similarly in deployment as compared to training, and its misaligned preferences are likely more materially and temporally bounded. Meanwhile, misaligned schemers seem basically worst-case likely to takeover. Even if you end up with an ~aligned schemer, I’d be pretty concerned because it’s incorrigible.
I’d also be excited for more (empirical) research here.
This is partially true (though much of alignment training uses RL). And in fact, the main reason why I go with a causal model of behavioral selection is so that it’s more general than assuming motivations are shaped with reward. So, things like “getting the model to generate its own fine-tuning data” can also be modeled in the behavioral selection model (though it might be a complicated selection mechanism).