Bogdan Ionut Cirstea comments on Many arguments for AI x-risk are wrong

Bogdan Ionut Cirstea 14 Mar 2024 4:23 UTC
1 point
0
I think of “light RLHF” as “RLHF which doesn’t teach the model qualitatively new things, but instead just steers the model at a high level”. In practice, a single round of DPO on <100,000 examples surely counts, but I’m unsure about the exact limits.
(In principle, a small amount of RL can update a model very far, I don’t think we see this in practice.)
Empirical evidence about this indeed being the case for DPO.
Note that even if RLHF takes a non-scheming pretrained model and converts it to a schemer, if this RLHF doesn’t teach the model new things, then we can always train another copy of the pretrained model to be a monitor for the RLHF’d model and this monitor will be just as competent as the RLHF’d model. So scheming seems like substantially less of a problem in this case. (We’d need to use this monitor for all potentially dangerous actions to get safety properties.) (This is similar to the proposal in Appendix G of the weak-to-strong generalization paper, but with this addition that you deploy the reward model as a monitor which is required for any interesting guarantees.)
Also see Interpreting the learning of deceit for another proposal/research agenda to deal with this threat model.
- ryan_greenblatt 14 Mar 2024 4:40 UTC
  2 points
  0
  Parent
  
  Also see Interpreting the learning of deceit for another proposal/research agenda to deal with this threat model.
  
  On a quick skim, I think this makes additional assumptions that seem pretty uncertain.