primarily because models will understand the base goal first before having world modeling
Could you say a bit more about why you think this? My definitely-not-expert expectation would be that the world-modeling would come first, then the “what does the overseer want” after that, because that’s how the current training paradigm works: pretrain for general world understanding, then finetune on what you actually want the model to do.
Admittedly, I got that from Deceptive alignment is <1% likely post.
Even if you don’t believe that post, Pretraining from human preferences shows that alignment with human values can be instilled first as a base goal, thus outer aligning it, before giving it world modeling capabilities, works wonders for alignment and has many benefits compared to RLHF.
Given the fact that it has a low alignment tax, I suspect that there’s a 50-70% chance that this plan, or a successor will be adopted for alignment.
Could you say a bit more about why you think this? My definitely-not-expert expectation would be that the world-modeling would come first, then the “what does the overseer want” after that, because that’s how the current training paradigm works: pretrain for general world understanding, then finetune on what you actually want the model to do.
Admittedly, I got that from Deceptive alignment is <1% likely post.
Even if you don’t believe that post, Pretraining from human preferences shows that alignment with human values can be instilled first as a base goal, thus outer aligning it, before giving it world modeling capabilities, works wonders for alignment and has many benefits compared to RLHF.
Given the fact that it has a low alignment tax, I suspect that there’s a 50-70% chance that this plan, or a successor will be adopted for alignment.
Here’s the post:
https://www.lesswrong.com/posts/8F4dXYriqbsom46x5/pretraining-language-models-with-human-preferences