Oliver Daniels comments on Narrow Misalignment is Hard, Emergent Misalignment is Easy

Oliver Daniels 26 Jul 2025 15:19 UTC
3 points
0
I’m surprised you’re surprised that the (simpler) policy found by SGD performs better than the (more complex) policy found by adding a conditional KL term. Let me try to pass your ITT:

In learning, there’s a tradeoff between performance and simplicity: overfitting leads to worse (iid) generalization, even though simpler policies may perform worse on the training set.

So if we are given two policies A, B produced with the same training process (but with different random seeds) and told policy A is more complex than policy B, we expect A to perform better on the training set, and B to perform better on the validation set. But here we see the opposite: policy B performs better on the validation set and the training set. So what’s up?

The key observation is that in this case, A and B are not produced by the same training process. In particular, the additional complexity of A is caused by an auxiliary loss term that we have no reason to expect would improve performance on the training dataset. And on the prior “adding additional loss terms degrades training loss”, we should decrease our expectation of A’s performance on the training set.
- Oliver Daniels 26 Jul 2025 15:23 UTC
  1 point
  0
  Parent
  tbc I was surprised by EM in general, just not this particular result