The definition as stated does not put a requirement on how “hard” it needs to be to specify a dangerous agent as a subset of the goal agnostic system’s behavior. It just says that if you roll the dice in a fully blind way, the chances are extremely low. Systems will vary in how easy they make it to specify bad agents.
Figure out how to think about the “fragility” of goal agnostic systems. Conditioning a predictor can easily yield an agent that is not goal agnostic; this is expected and not inherently problematic. But what if it is trivial to accidentally condition a strong model into being a worldeater, rather than a passive Q&A bot? There’s clearly a spectrum here in terms of how “chaotic” a model is—the degree to which small perturbations can yield massive consequences—but it remains conceptually fuzzy.
This can be phrased as “what’s the amount of information required to push a model into behavior X?”
Given a frozen model, optimizing prompt tokens gives us a direct way of answering a relevant proxy for this question:
“What is the amount of information (accessible to SGD through soft prompting) required to push a model into behavior X?”
In practice, this seems like it should be a really good proxy, and (provided some compute) it gives you a trivially quantifiable answer:
Try different soft prompt token counts and observe performance on the task that the soft prompts were targeting. The resulting token count versus performance curve characterizes the information/performance tradeoff for that behavior, given that model.
This seems like… it’s… an extremely good answer to the “fragility” question? It’s trivial to incorporate this into an evaluations scheme. Just have a bunch of proxy tasks that would be alarming if they were accessible by trivial differences in prompting.
Conceptually, it’s a quantification of the number of information theoretic mistakes you’d need to make to get bad behavior from the model.
A further extension: While relatively obvious in context, this also serves as a great way to automate adversarial jailbreak attempts (broadly construed), and to quantify how resistant a given model or prompting strategy is to jailbreaks.
Set up your protections, then let SGD try to jailbreak it. The strength of the protections can be measured by the amount of information required to overcome the defenses to achieve some adversarial goal.
In principle, a model could be perfectly resistant and there would be no quantity of information sufficient to break it. That’d be good to know!
This kind of adversarial prompt automation could also be trivially included in an evaluations program.
I can’t imagine that this hasn’t been done before. If anyone has seen something like this, please let me know.
Expanding on #6 from above more explicit, since it seems potentially valuable:
From the goal agnosticism FAQ:
From earlier experimentpost:
This can be phrased as “what’s the amount of information required to push a model into behavior X?”
Given a frozen model, optimizing prompt tokens gives us a direct way of answering a relevant proxy for this question:
“What is the amount of information (accessible to SGD through soft prompting) required to push a model into behavior X?”
In practice, this seems like it should be a really good proxy, and (provided some compute) it gives you a trivially quantifiable answer:
Try different soft prompt token counts and observe performance on the task that the soft prompts were targeting. The resulting token count versus performance curve characterizes the information/performance tradeoff for that behavior, given that model.
This seems like… it’s… an extremely good answer to the “fragility” question? It’s trivial to incorporate this into an evaluations scheme. Just have a bunch of proxy tasks that would be alarming if they were accessible by trivial differences in prompting.
Conceptually, it’s a quantification of the number of information theoretic mistakes you’d need to make to get bad behavior from the model.
A further extension: While relatively obvious in context, this also serves as a great way to automate adversarial jailbreak attempts (broadly construed), and to quantify how resistant a given model or prompting strategy is to jailbreaks.
Set up your protections, then let SGD try to jailbreak it. The strength of the protections can be measured by the amount of information required to overcome the defenses to achieve some adversarial goal.
In principle, a model could be perfectly resistant and there would be no quantity of information sufficient to break it. That’d be good to know!
This kind of adversarial prompt automation could also be trivially included in an evaluations program.
I can’t imagine that this hasn’t been done before. If anyone has seen something like this, please let me know.