Daniel Tan comments on Daniel Tan’s Shortform

Daniel Tan 13 Jan 2025 16:45 UTC
4 points
0
Hypothesis: ‘Memorised’ refusal is more easily jailbroken than ‘generalised’ refusal. If so that’d be a way we could test the insights generated by influence functions
I need to consult some people on whether a notion of ‘more easily jailbreak-able prompt’ exists.
Edit: A simple heuristic might be the value of N in best-of-N jailbreaking.