Alice Blair comments on MLSN #18: Adversarial Diffusion, Activation Oracles, Weird Generalization

Alice Blair 21 Jan 2026 6:17 UTC
2 points
0
But, as I understand the AO paper, they do not actually validate on models like these, they only test on models that they have trained to be misaligned in pretty straightforward ways. I still think it’s good work, I’m moreso just talking about a gap that future work could address.
- RogerDearnaley 21 Jan 2026 13:19 UTC
  2 points
  0
  Parent
  I agree, and was pointing in the direction of the nearest thing to that I’m currently aware of, for someone who wanted to try this now.