Anthony DiGiovanni comments on Shutting Down the Lightcone Offices

Anthony DiGiovanni 17 Mar 2023 21:34 UTC
2 points
0
primarily because models will understand the base goal first before having world modeling
Could you say a bit more about why you think this? My definitely-not-expert expectation would be that the world-modeling would come first, then the “what does the overseer want” after that, because that’s how the current training paradigm works: pretrain for general world understanding, then finetune on what you actually want the model to do.
- Noosphere89 17 Mar 2023 21:50 UTC
  2 points
  1
  Parent
  Admittedly, I got that from Deceptive alignment is <1% likely post.
  
  Even if you don’t believe that post, Pretraining from human preferences shows that alignment with human values can be instilled first as a base goal, thus outer aligning it, before giving it world modeling capabilities, works wonders for alignment and has many benefits compared to RLHF.
  
  Given the fact that it has a low alignment tax, I suspect that there’s a 50-70% chance that this plan, or a successor will be adopted for alignment.
  
  Here’s the post:
  
  https://www.lesswrong.com/posts/8F4dXYriqbsom46x5/pretraining-language-models-with-human-preferences