Should the title be rephrased? The body of the post essentially addresses the scenario where OpenAI discovers (possibly via red-teaming) that their model is dangerous, and asks where do they go from there. Which is a good question; it seems plausible that some naively think they could just tweak the model somewhat and make it non-dangerous, which seems to me a recipe for disaster when people get to jailbreaking it or get a copy and start un-tweaking it. (A probably-better response would be to (a) lock it the fuck down [consider deleting it entirely], (b) possibly try to obfuscate the process that was used to create it, to obstruct others who’d try to reproduce your results, and (c) announce the danger, either to the world or specifically to law-enforcement types who’d work on preventing all AI labs from progressing any further.)
But the title asks “How do you red-team it to see whether it’s safe”, for which the straightforward answer is “Have people try to demonstrate dangerous capabilities with it”.
There are various things that could happen and it depends on your estimate of likelihoods. If we have today’s level of safety measures, and a new model comes along with world-ending capabilities that we can’t detect, then, yeah, we don’t detect them and the model goes out and ends the world.
It’s possible, though, that before that happens, they’ll create a model that has dangerous (possibly world-ending, possibly just highly destructive) capabilities (like knowing how to cross-breed smallpox with COVID, or how to hack into most internet-attached computers and parlay that into all sorts of mayhem) but that isn’t good at concealing them, and they’ll detect this, and announce it to the world, which would hopefully say a collective “Oh fuck” and use this result to justify imposing majorly increased security and safety mandates for further AI development.
That would put us in a better position. Then maybe further iteration on that would convince the relevant people “You can’t make it safe; abandon this stuff, use only the previous generation for the economic value it provides, and go and forcibly prevent everyone in the world from making a model this capable.”
With LLMs as they are, I do suspect that dangerous cybersecurity capabilities are fairly near-term, and I don’t think near-term models are likely to hide those capabilities (though it may take some effort to elicit them). So some portions of the above seem likely to me; others are much more of a gamble. I’d say some version of the above has a … 10-30% chance of saving us?
Should the title be rephrased? The body of the post essentially addresses the scenario where OpenAI discovers (possibly via red-teaming) that their model is dangerous, and asks where do they go from there. Which is a good question; it seems plausible that some naively think they could just tweak the model somewhat and make it non-dangerous, which seems to me a recipe for disaster when people get to jailbreaking it or get a copy and start un-tweaking it. (A probably-better response would be to (a) lock it the fuck down [consider deleting it entirely], (b) possibly try to obfuscate the process that was used to create it, to obstruct others who’d try to reproduce your results, and (c) announce the danger, either to the world or specifically to law-enforcement types who’d work on preventing all AI labs from progressing any further.)
But the title asks “How do you red-team it to see whether it’s safe”, for which the straightforward answer is “Have people try to demonstrate dangerous capabilities with it”.
There are various things that could happen and it depends on your estimate of likelihoods. If we have today’s level of safety measures, and a new model comes along with world-ending capabilities that we can’t detect, then, yeah, we don’t detect them and the model goes out and ends the world.
It’s possible, though, that before that happens, they’ll create a model that has dangerous (possibly world-ending, possibly just highly destructive) capabilities (like knowing how to cross-breed smallpox with COVID, or how to hack into most internet-attached computers and parlay that into all sorts of mayhem) but that isn’t good at concealing them, and they’ll detect this, and announce it to the world, which would hopefully say a collective “Oh fuck” and use this result to justify imposing majorly increased security and safety mandates for further AI development.
That would put us in a better position. Then maybe further iteration on that would convince the relevant people “You can’t make it safe; abandon this stuff, use only the previous generation for the economic value it provides, and go and forcibly prevent everyone in the world from making a model this capable.”
With LLMs as they are, I do suspect that dangerous cybersecurity capabilities are fairly near-term, and I don’t think near-term models are likely to hide those capabilities (though it may take some effort to elicit them). So some portions of the above seem likely to me; others are much more of a gamble. I’d say some version of the above has a … 10-30% chance of saving us?