Thomas Kwa comments on Deep Forgetting & Unlearning for Safely-Scoped LLMs

Thomas Kwa 5 Dec 2023 22:15 UTC
LW: 10 AF: 6
5
AF
Strong upvoted, I think the ability to selectively and robustly remove capabilities could end up being really valuable in a wide range of scenarios, as well as being tractable.
There are two types of capabilities that it may be good to scope out of models:
- Facts: specific bits of knowledge. For example, we would like LLMs not to know the ingredients and steps to make weapons of terror.
- Tendencies: other types of behavior. For example, we would like LLMs not to be dishonest or manipulative.
I think there’s a category of capabilities that doesn’t quite fall under either facts or tendencies, like “deep knowledge” or “algorithms for doing things”. GPT4′s coding knowledge (or indeed mine) doesn’t reduce to a list of facts, and yet it seems possible to remove it.
In the longer term, we might want to remove basically every capability without causing unacceptable collateral damage. Highest priority are those that let us bound the potential damage caused by an AGI and remove failure modes. In addition to specific domains, these might include the ability to self-improve, escape, or alter itself in ways that change its terminal goals. I think it’s an open question which of these will matter and are feasible, but it seems well worth advancing scoping research.
- scasper 6 Dec 2023 4:20 UTC
  LW: 4 AF: 2
  0
  AF Parent
  +1, I’ll add this and credit you.