Note that even if RLHF takes a non-scheming pretrained model and converts it to a schemer, if this RLHF doesn’t teach the model new things, then we can always train another copy of the pretrained model to be a monitor for the RLHF’d model and this monitor will be just as competent as the RLHF’d model. So scheming seems like substantially less of a problem in this case. (We’d need to use this monitor for all potentially dangerous actions to get safety properties.) (This is similar to the proposal in Appendix G of the weak-to-strong generalization paper, but with this addition that you deploy the reward model as a monitor which is required for any interesting guarantees.)
I’d personally like to see this written up in more details (or a reference). Also, is it Appendix G of the weak-to-strong generalization paper? I looked briefly and it didn’t seem very related.
I’d personally like to see this written up in more details (or a reference). Also, is it Appendix G of the weak-to-strong generalization paper? I looked briefly and it didn’t seem very related.
No current write up exists from my understanding. I might write this up as part of a broader project expanding various points about scheming.