I don’t fully understand RLHF, so let me just put my idea out there and someone can tell me how I am confused.
Assume that RLHF can in fact “align” an otherwise dangerous system. Here’s my question: You train the system, and then you start applying RLHF, right? Doesn’t the dangerous system already exist, then, before the RLHF is applied? If so, isn’t that dangerous?
Anyway I don’t know any technical details so I assume I simply don’t understand how these things work.
For the general idea of RLHF, no, the dangerous system doesn’t have to ever exist—the reward model can be learned in tandem with improvements in the AI’s capabilities. Like in the original paper.
But for RLHF on large language models, like what OpenAI is doing a lot of, yes, there will be an un-RLHFed system that’s very smart. But that system wouldn’t necessarily be dangerous, in the case of a LLM the base model would probably just be very good at predicting the next word—RLHF fine-tuning would both be making it “safer” in the sense of making it produce text positively rated by humans, but also “more dangerous” in the sense of generating text better at achieving objectives in the real world.
Tyty very helpful and illuminating