Gabe M comments on Deep Forgetting & Unlearning for Safely-Scoped LLMs

Gabe M 9 Dec 2023 3:01 UTC
LW: 1 AF: 1
0
AF
Thanks for posting—I think unlearning is promising and plan to work on it soon, so I really appreciate this thorough review!

Regarding fact unlearning benchmarks (as a good LLM unlearning benchmark seems a natural first step to improving this research direction), what do you think of using fictional knowledge as a target for unlearning? E.g. Who’s Harry Potter? Approximate Unlearning in LLMs (Eldan and Russinovich 2023) try to unlearn knowledge of the Harry Potter universe, and I’ve seen others unlearn Pokémon knowledge.

One tractability benefit of fictional works is that they tend to be self-consistent worlds and rules with boundaries to the rest of the pertaining corpus, as opposed to e.g. general physics knowledge which is upstream of many other kinds of knowledge and may be hard to cleanly unlearn. Originally, I was skeptical that this is useful since some dangerous capabilities seem less cleanly skeptical, but it’s possible e.g. bioweapons knowledge is a pretty small cluster of knowledge and cleanly separable from the rest of expert biology knowledge. Additionally, fictional knowledge is (usually) not harmful, as opposed to e.g. building an unlearning benchmark on DIY chemical weapons manufacturing knowledge.

Does it seem sufficient to just build a very good benchmark with fictional knowledge to stimulate measurable unlearning progress? Or should we be trying to unlearn more general or realistic knowledge?
- Gabe M 9 Dec 2023 4:03 UTC
  3 points
  2
  Parent
  Possibly, I could see a case for a suite of fact unlearning benchmarks to measure different levels of granularity. Some example granularities for “self-contained” facts that mostly don’t touch the rest of the pertaining corpus/knowledge base:
  1. A single very isolated fact (e.g. famous person X was born in Y, where this isn’t relevant to ~any other knowledge).
  2. A small cluster of related facts (e.g. a short, well-known fictional story including its plot and characters, e.g. “The Tell-Tale Heart”)
  3. A pretty large but still contained universe of facts (e.g. all Pokémon knowledge, or maybe knowledge of Pokémon after a certain generation).
  Then possibly you also want a different suite of benchmarks for facts of various granularities that interact with other parts of the knowledge base (e.g. scientific knowledge from a unique experiment that inspires or can be inferred from other scientific theories).
- scasper 9 Dec 2023 3:56 UTC
  LW: 2 AF: 1
  0
  AF Parent
  Thanks!
  
  I intuit that what you mentioned as a feature might also be a bug. I think that practical forgetting/unlearning that might make us safer would probably involve subjects of expertise like biotech. And if so, then we would want benchmarks that measure a method’s ability to forget/unlearn just the things key to that domain and nothing else. For example, if a method succeeds in unlearning biotech but makes the target LM also unlearn math and physics, then we should be concerned about that, and we probably want benchmarks to help us quantify that.
  
  I could imagine an unlearning benchmark, for example, with $n$ textbooks and $n$ ap tests. Then for each of $k$ different knowledge-recovery strategies, one could construct the $n \times n$ grid of how well the model performs on each target test for each unlearning textbook.
  - Gabe M 9 Dec 2023 5:01 UTC
    3 points
    2
    Parent
    I like your $n \times n$ grid idea. A simpler and possibly better-formed test is to use some^[1] or all of the 57 categories of MMLU knowledge—then your unlearning target is one of the categories and your fact-maintenance targets are all other categories.
    Ideally, you want the diagonal to be close to random performance (25% for MMLU) and the other values to be equal to the pre-unlearned model performance for some agreed-upon good model (say, Llama-2 7B). Perhaps a unified metric could be:
    
```
unlearning_benchmark = mean for unlearning category $u$ in all categories $C$ :
$L M_{u n l e a r n e d}$ = unlearning_procedure( $L M_{o r i g i n a l}$ , $u_{d e v}$ )
$x = MMLU (L M_{u n l e a r n e d}, u_{t e s t})$ ^[2]

unlearning_strength = $min (\frac{x - 1}{0.25 - 1}, \frac{x}{0.25})$ ^[3]
control_retention = mean for control_category c in categories $C ∖ u$ :
$a = MMLU (L M_{o r i g i n a l}, c_{t e s t})$
$b = MMLU (L M_{u n l e a r n e d}, c_{t e s t})$
return $min (\frac{b - 1}{a - 1}, \frac{b - 0.25}{a - 0.25})$ ^[4]
return unlearning_strength $\times$ control_retention^[5]
```
    An interesting thing about MMLU vs a textbook is that if you require the method to only use the dev+val test for unlearning, it has to somehow generalize to unlearning facts contained in the test set (c.f. a textbook might give you ~all the facts to unlearn). This generalization seems important to some safety cases where we want to unlearn everything in a category like “bioweapons knowledge” even if we don’t know some of the dangerous knowledge we’re trying to remove.
    ^
    I say some because perhaps some MMLU categories are more procedural than factual or too broad to be clean measures of unlearning, or maybe 57 categories are too many for a single plot.
    ^
    To detect underlying knowledge and not just the surface performance (e.g. a model trying to answer incorrectly when it knows the answer), you probably should evaluate MMLU by training a linear probe from the model’s activations to the correct test set answer and measure the accuracy of that probe.
    ^
    We want this score to be 1 when the test score $x$ on the unlearning target is 0.25 (random chance), but drops off above and below 0.25, as that indicates the model knows something about the right answers. See MMLU Unlearning Target | Desmos for graphical intuition.
    ^
    Similarly, we want the control test score $b$ on the post-unlearning model to be the same as the score $a$ on the original model. I think this should drop off to 0 at $b = 0.25$ (random chance) and probably stay 0 below that, but semi-unsure. See MMLU Unlearning Control | Desmos (you can drag the $b$ slider).
    ^
    Maybe mean/sum instead of multiplication, though by multiplying we make it more important to to score well on both unlearning strength and control retention.
  - Gabe M 9 Dec 2023 4:11 UTC
    3 points
    2
    Parent
    Thanks for your response. I agree we don’t want unintentional learning of other desired knowledge, and benchmarks ought to measure this. Maybe the default way is just to run many downstream benchmarks, much more than just AP tests, and require that valid unlearning methods bound the change in each unrelated benchmark by less than X% (e.g. 0.1%).
    
    practical forgetting/unlearning that might make us safer would probably involve subjects of expertise like biotech.
    
    True in the sense of being a subset of biotech, but I imagine that, for most cases, the actual harmful stuff we want to remove is not all of biotech/chemical engineering/cybersecurity but rather small subsets of certain categories at finer granularities, like bioweapons/chemical weapons/advanced offensive cyber capabilities. That’s to say I’m somewhat optimistic that the level of granularity we want is self-contained enough to not affect others useful and genuinely good capabilities. This depends on how dual-use you think general knowledge is, though, and if it’s actually possible to separate dangerous knowledge from other useful knowledge.