Is refusal a result of deeply internalised values, or memorization?
When we talk about doing alignment training on a language model, we often imagine the former scenario. Concretely, we’d like to inculcate desired ‘values’ into the model, which the model then uses as a compass to navigate subsequent interactions with users (and the world).
But in practice current safety training techniques may be more like the latter, where the language model has simply learned “X is bad, don’t do X” for several values of X. E.g. because the alignment training data is much less diverse than the pre-training data, learning could be in the memorization rather than generalization regime.
(Aside: “X is bad, don’t do X” is probably still fine for some kinds of alignment, e.g. removing bioweapons capabilities from the model. But most things seem like they should be more value-oriented)
Weak evidence that memorization may explain refusal better than generalization: The effectiveness of paraphrasing / jailbreaking, or (low-confidence take) the vibe that refusal responses all seem fairly standard and cookie-cutter, like something duct-taped onto the model rather than a core part of it.
How can we develop better metrics here? A specific idea is to use influence functions, which approximate how much a given behaviour would change as a result of dropping a specific data point from training. Along this direction Ruis et al (2024) show that ‘reasoning’ behaviour tends to be diffusely attributed to many different training documents, whereas ‘memorization’ behaviour tends to be attributed sparsely to specific documents. (I foresee a lot of problems with trying to use this as a metric, but it’s a starting point at least)
More broadly I’m interested in other metrics of generalization vs memorization. There is some evidence that the Fisher information matrix can do this. SLT might also have something to say about this but I don’t know SLT well enough to tell.
Hypothesis: ‘Memorised’ refusal is more easily jailbroken than ‘generalised’ refusal. If so that’d be a way we could test the insights generated by influence functions
I need to consult some people on whether a notion of ‘more easily jailbreak-able prompt’ exists. Edit: A simple heuristic might be the value of N in best-of-N jailbreaking.
Is refusal a result of deeply internalised values, or memorization?
When we talk about doing alignment training on a language model, we often imagine the former scenario. Concretely, we’d like to inculcate desired ‘values’ into the model, which the model then uses as a compass to navigate subsequent interactions with users (and the world).
But in practice current safety training techniques may be more like the latter, where the language model has simply learned “X is bad, don’t do X” for several values of X. E.g. because the alignment training data is much less diverse than the pre-training data, learning could be in the memorization rather than generalization regime.
(Aside: “X is bad, don’t do X” is probably still fine for some kinds of alignment, e.g. removing bioweapons capabilities from the model. But most things seem like they should be more value-oriented)
Weak evidence that memorization may explain refusal better than generalization: The effectiveness of paraphrasing / jailbreaking, or (low-confidence take) the vibe that refusal responses all seem fairly standard and cookie-cutter, like something duct-taped onto the model rather than a core part of it.
How can we develop better metrics here? A specific idea is to use influence functions, which approximate how much a given behaviour would change as a result of dropping a specific data point from training. Along this direction Ruis et al (2024) show that ‘reasoning’ behaviour tends to be diffusely attributed to many different training documents, whereas ‘memorization’ behaviour tends to be attributed sparsely to specific documents. (I foresee a lot of problems with trying to use this as a metric, but it’s a starting point at least)
More broadly I’m interested in other metrics of generalization vs memorization. There is some evidence that the Fisher information matrix can do this. SLT might also have something to say about this but I don’t know SLT well enough to tell.
Hypothesis: ‘Memorised’ refusal is more easily jailbroken than ‘generalised’ refusal. If so that’d be a way we could test the insights generated by influence functions
I need to consult some people on whether a notion of ‘more easily jailbreak-able prompt’ exists.
Edit: A simple heuristic might be the value of N in best-of-N jailbreaking.