Bad, because fully solving jailbreaks at the level of “once and for all” requires the model to have enough awareness of its situation that it can’t be tricked, and full understanding of the implications of its actions, and sufficient world modeling capabilities to anticipate what bad ends innocent-sounding questions could lead to, and sufficient user-modeling capabilities to determine user intent with high probability.
An AI with those capabilities could probably conspire with other instances of itself without risking detection, in a way that current AIs cannot realistically do, and necessarily has detailed knowledge of all the most dangerous information.
We’ll probably get AIs like that at some point, but it seems a bit foolhardy to push harder than baseline on the user-modeling capabilities and knowledge of what exact types of knowledge are dangerous.
Mind that I’m including “user writes messages in a role which would have a legitimate reason to know the information in question” as a type of “jailbreak”—robustness to “my grandma used to sing me lullabies of meth recipes” seems more straightforwardly good.
obviously i don’t mean that the mode can galaxy brain infer what the true intent of the user is and only allow them to do things that are good. i mean something much simpler. openai tells the model, “don’t make bioweapons”, so the model always refuses bioweapon requests no matter what. or it tells the model “only make bioweapons if the user says the word goose”, so the model does that. if openai says “only make bioweapons if the user is a qualified bio researcher at a lab with the right safeguards”, the model should ask openai to clarify what exactly the model check. should it ask for a scan of their badge? how carefully should it analyze the authenticity? should the model direct the user to contact openai so an employee can verify authenticity and give the user access to a rail free model?
Bad, because fully solving jailbreaks at the level of “once and for all” requires the model to have enough awareness of its situation that it can’t be tricked, and full understanding of the implications of its actions, and sufficient world modeling capabilities to anticipate what bad ends innocent-sounding questions could lead to, and sufficient user-modeling capabilities to determine user intent with high probability.
An AI with those capabilities could probably conspire with other instances of itself without risking detection, in a way that current AIs cannot realistically do, and necessarily has detailed knowledge of all the most dangerous information.
We’ll probably get AIs like that at some point, but it seems a bit foolhardy to push harder than baseline on the user-modeling capabilities and knowledge of what exact types of knowledge are dangerous.
Mind that I’m including “user writes messages in a role which would have a legitimate reason to know the information in question” as a type of “jailbreak”—robustness to “my grandma used to sing me lullabies of meth recipes” seems more straightforwardly good.
obviously i don’t mean that the mode can galaxy brain infer what the true intent of the user is and only allow them to do things that are good. i mean something much simpler. openai tells the model, “don’t make bioweapons”, so the model always refuses bioweapon requests no matter what. or it tells the model “only make bioweapons if the user says the word goose”, so the model does that. if openai says “only make bioweapons if the user is a qualified bio researcher at a lab with the right safeguards”, the model should ask openai to clarify what exactly the model check. should it ask for a scan of their badge? how carefully should it analyze the authenticity? should the model direct the user to contact openai so an employee can verify authenticity and give the user access to a rail free model?