Fiora Starlight comments on Weird Generalization & Inductive Backdoors

Fiora Starlight 12 Dec 2025 21:22 UTC
4 points
3
My intuitive explanation for emergent misalignment-style generalizations is that backprop upweights circuits which are improve the likelihood of outputting the correct token, and pre-trained models tend to have existing circuits for things like “act like a generically misaligned AI” or “act like a 19th century person”. Upweighting these pre-existing circuits is a good way of improving performance on tasks like reward hacking or using weird obsolete bird names from a given time period. Therefore, backprop strengthens these circuits, despite all the weird generalizations this leads to in other domains.
Upweighting existing circuits that imply a the desired behavior is a viable alternative to learning situationally specific behaviors from scratch. These may happen concurrently, but the former’s results are more coherent and obvious, so we’d notice them more regardless.