For the general idea of RLHF, no, the dangerous system doesn’t have to ever exist—the reward model can be learned in tandem with improvements in the AI’s capabilities. Like in the original paper.
But for RLHF on large language models, like what OpenAI is doing a lot of, yes, there will be an un-RLHFed system that’s very smart. But that system wouldn’t necessarily be dangerous, in the case of a LLM the base model would probably just be very good at predicting the next word—RLHF fine-tuning would both be making it “safer” in the sense of making it produce text positively rated by humans, but also “more dangerous” in the sense of generating text better at achieving objectives in the real world.
For the general idea of RLHF, no, the dangerous system doesn’t have to ever exist—the reward model can be learned in tandem with improvements in the AI’s capabilities. Like in the original paper.
But for RLHF on large language models, like what OpenAI is doing a lot of, yes, there will be an un-RLHFed system that’s very smart. But that system wouldn’t necessarily be dangerous, in the case of a LLM the base model would probably just be very good at predicting the next word—RLHF fine-tuning would both be making it “safer” in the sense of making it produce text positively rated by humans, but also “more dangerous” in the sense of generating text better at achieving objectives in the real world.