I share a pretty similar model. Though I don’t think I share your likelihood assessments (“RLHF is likely to teach GPT the use of steganography.”, “This is very likely to make the chains of thought of your model opaque because if the model trained with RLHF learned steganography, your model will need to learn it in order to predict what it will say. ”, “Deceiving them if they can watch every thought of yours right from the beginning would probably be extremely challenging,”). That may stem from just having a different set of intuitions about what RL(HF) does & about how the sample + broadcast mechanism interacts with circuits.
I share a pretty similar model. Though I don’t think I share your likelihood assessments (“RLHF is likely to teach GPT the use of steganography.”, “This is very likely to make the chains of thought of your model opaque because if the model trained with RLHF learned steganography, your model will need to learn it in order to predict what it will say. ”, “Deceiving them if they can watch every thought of yours right from the beginning would probably be extremely challenging,”). That may stem from just having a different set of intuitions about what RL(HF) does & about how the sample + broadcast mechanism interacts with circuits.