My intuitive explanation for emergent misalignment-style generalizations is that backprop upweights circuits which are improve the likelihood of outputting the correct token, and pre-trained models tend to have existing circuits for things like “act like a generically misaligned AI” or “act like a 19th century person”. Upweighting these pre-existing circuits is a good way of improving performance on tasks like reward hacking or using weird obsolete bird names from a given time period. Therefore, backprop strengthens these circuits, despite all the weird generalizations this leads to in other domains.
Upweighting existing circuits that imply a the desired behavior is a viable alternative to learning situationally specific behaviors from scratch. These may happen concurrently, but the former’s results are more coherent and obvious, so we’d notice them more regardless.
My intuitive explanation for emergent misalignment-style generalizations is that backprop upweights circuits which are improve the likelihood of outputting the correct token, and pre-trained models tend to have existing circuits for things like “act like a generically misaligned AI” or “act like a 19th century person”. Upweighting these pre-existing circuits is a good way of improving performance on tasks like reward hacking or using weird obsolete bird names from a given time period. Therefore, backprop strengthens these circuits, despite all the weird generalizations this leads to in other domains.
Upweighting existing circuits that imply a the desired behavior is a viable alternative to learning situationally specific behaviors from scratch. These may happen concurrently, but the former’s results are more coherent and obvious, so we’d notice them more regardless.