Bogdan Ionut Cirstea comments on Many arguments for AI x-risk are wrong

Bogdan Ionut Cirstea 19 Mar 2024 20:52 UTC
1 point
0
Note that even if RLHF takes a non-scheming pretrained model and converts it to a schemer, if this RLHF doesn’t teach the model new things, then we can always train another copy of the pretrained model to be a monitor for the RLHF’d model and this monitor will be just as competent as the RLHF’d model. So scheming seems like substantially less of a problem in this case. (We’d need to use this monitor for all potentially dangerous actions to get safety properties.) (This is similar to the proposal in Appendix G of the weak-to-strong generalization paper, but with this addition that you deploy the reward model as a monitor which is required for any interesting guarantees.)
I’d personally like to see this written up in more details (or a reference). Also, is it Appendix G of the weak-to-strong generalization paper? I looked briefly and it didn’t seem very related.
- ryan_greenblatt 19 Mar 2024 21:01 UTC
  3 points
  0
  Parent
  
  I’d personally like to see this written up in more details (or a reference).
  
  No current write up exists from my understanding. I might write this up as part of a broader project expanding various points about scheming.