Hypothesis: ‘Memorised’ refusal is more easily jailbroken than ‘generalised’ refusal. If so that’d be a way we could test the insights generated by influence functions
I need to consult some people on whether a notion of ‘more easily jailbreak-able prompt’ exists. Edit: A simple heuristic might be the value of N in best-of-N jailbreaking.
Hypothesis: ‘Memorised’ refusal is more easily jailbroken than ‘generalised’ refusal. If so that’d be a way we could test the insights generated by influence functions
I need to consult some people on whether a notion of ‘more easily jailbreak-able prompt’ exists.
Edit: A simple heuristic might be the value of N in best-of-N jailbreaking.