[Question] Beginner’s question about RLHF

I don’t fully understand RLHF, so let me just put my idea out there and someone can tell me how I am confused.

Assume that RLHF can in fact “align” an otherwise dangerous system. Here’s my question: You train the system, and then you start applying RLHF, right? Doesn’t the dangerous system already exist, then, before the RLHF is applied? If so, isn’t that dangerous?

Anyway I don’t know any technical details so I assume I simply don’t understand how these things work.