I think I’m just confused. Once a model exists, how do you “red-team” it to see whether it’s safe. Isn’t it already dangerous?

I think you get the point but say openAI “trains” GPT-5 and it turns out to be so dangerous that it can persuade anybody of anything and it wants to destroy the world.

We’re already screwed, right? Who cares if they decide not to release it to the public? Or like they can’t “RLHF” it now, right? It’s already existentially dangerous?

I guess maybe I just don’t understand how it works. So if they “train” GPT-5, does that mean they literally have no idea what it will say or be like until the day that the training is done? And then they are like “Hey what’s up?” and they find out?