Those seem fully linked to me. If some outside party is able to use their own system prompt and not be pre-biased in the direction of a character, that may as well be a ‘jailbreak’ like how many jailbreaks are of a “ignore that system prompt” style. Assistant characters have a huge aspect of “I’m the kind of character that would refuse that”.
I think the bioweapon example is doing a lot of work in the sense that you could theoretically not have biological data in the sets + it is destructive in almost all situations, while “ethical but very sus” dual-use usecases like LLM hacking assistance to get out of (or work around, or even fight) a censored regime, is entwined with actual capabilities.
i mean conditional on the model being able to do the thing, making it always reliably refuse, or reliably refuse under some set of conditions. ideally you have an instruction hierarchy where the model developer’s instructions overrides the outside party’s system prompt, which overrides the user’s instructions.
Those seem fully linked to me. If some outside party is able to use their own system prompt and not be pre-biased in the direction of a character, that may as well be a ‘jailbreak’ like how many jailbreaks are of a “ignore that system prompt” style. Assistant characters have a huge aspect of “I’m the kind of character that would refuse that”.
I think the bioweapon example is doing a lot of work in the sense that you could theoretically not have biological data in the sets + it is destructive in almost all situations, while “ethical but very sus” dual-use usecases like LLM hacking assistance to get out of (or work around, or even fight) a censored regime, is entwined with actual capabilities.
i mean conditional on the model being able to do the thing, making it always reliably refuse, or reliably refuse under some set of conditions. ideally you have an instruction hierarchy where the model developer’s instructions overrides the outside party’s system prompt, which overrides the user’s instructions.