Penalize Model Complexity Via Self-Distillation

research_prime_space4 Apr 2023 18:52 UTC

LW: 15 AF: 6

When you self-distill a model (e.g. train a new model using predictions from your old model), the resulting model represents a less complex function. After many rounds of self-distillation, you essentially end up with a constant function. This paper makes the above more precise.

Anyway, if you apply multiple rounds of self-distillation to a model, it becomes less complex. So if the original model learned complex, power-seeking behaviors that doesn’t help it do well on the training data, this behavior would likely go away after several rounds of self-distillation. Self-distillation allows you to essentially get the minimum complexity model that still does well on the test set. Thus, I think it’s promising from an AI safety standpoint.

What links here?

the gears to ascension's comment on LW Team is adjusting moderation policy by Raemon (5 Apr 2023 14:31 UTC; 4 points)

research_prime_space4 Apr 2023 18:52 UTC

LW: 15 AF: 6

7 comments1 min readLW link

scasper 7 Apr 2023 16:23 UTC
LW: 4 AF: 3
−2
AF
I agree with this take. In general, I would like to see self-distillation, distillation in general, or other network compression techniques be studied more thoroughly for de-agentifying, de-backdooring, and robistifying networks. I think this would work pretty well and probably be pretty tractable to make ground on.
- research_prime_space 8 Apr 2023 2:49 UTC
  LW: 2 AF: 2
  1
  AF Parent
  I think self-distillation is better than network compression, as it possesses some decently strong theoretical guarantees that you’re reducing the complexity of the function. I haven’t really seen the same with the latter.
  But what research do you think would be valuable, other than the obvious (self-distill a deceptive, power-hungry model to see if the negative qualities go away)?
  - scasper 9 Apr 2023 22:25 UTC
    LW: 1 AF: 1
    0
    AF Parent
    One idea that comes to mind is to see if a chatbot who is vulnerable to DAN-type prompts could be made to be robust to them by self-distillation on non-DAN-type prompts.
    
    I’d also really like to see if self-distillation or similar could be used to more effectively scrub away undetectable trojans. https://arxiv.org/abs/2204.06974
    - research_prime_space 10 Apr 2023 0:13 UTC
      LW: 1 AF: 1
      0
      AF Parent
      I don’t really think that 1. would be true—following DAN-style prompts is the minimum complexity solution. You want to act in accordance with the prompt.
      Backdoors don’t emerge naturally. So if it’s computationally infeasible to find an input where the original model and the backdoored model differ, then self-distillation on the backdoored model is going to be the same as self-distillation on the original model.
      The only scenario where I think self-distillation is useful would be if 1) you train a LLM on a dataset, 2) fine-tune it to be deceptive/power-seeking, and 3) self-distill it on the original dataset, then self-distilled model would likely no longer be deceptive/power-seeking.
quetzal_rainbow 7 Apr 2023 18:36 UTC
1 point
0
if the original model learned complex, power-seeking behaviors that doesn’t help it do well on the training data
The problem with power-seeking behavior is that it helps to do well in quite broad range of tasks.
- research_prime_space 8 Apr 2023 0:58 UTC
  2 points
  0
  Parent
  As of right now, I don’t think that LLMs are trained to be power seeking and deceptive.
  Power-seeking is likely if the model is directly maximizing rewards, but LLMs are not quite doing this.
research_prime_space 7 Apr 2023 16:02 UTC
LW: 1 AF: 1
0
AF
I just wanted to add another angle. Neural networks have a fundamental “simplicity bias”, where they learn low frequency components exponentially faster than high frequency components. Thus, self-distillation is likely to be more efficient than training on the original dataset (the function you’re learning has fewer high frequency components). This paper formalizes this claim.
But in practice, what this means is that training GPT-3.5 from scratch is hard but simply copying GPT-3.5 is pretty easy. Stanford was recently able to finetune a pretty bad 7B model to be as good as GPT-3.5 using only 52K examples (generated from GPT-3.5) and $600 of compute. This means that once a GPT is out there, it’s fairly easy for malevolent actors to replicate it. And while it’s unlikely that the original GPT model, given its strong simplicity bias, is engaging in complicated deceptive behavior, it’s highly likely that the malevolent actor has finetuned their model to be deceptive and power-seeking. This creates a perfect storm where malevolent AI can go rogue. I think this is a significant threat, and OpenAI should add some more guardrails to try and prevent this.