Do you have ideas about the mechanism by which models might be exploiting these spurious correlations in their weights? I can imagine this would be analogous to a human “going with their first thought” or “going with their gut”, but I have a hard time conceptualizing what that would look like for an LLM . If there is any existing research/writing on this, I’d love to check it out
I think that’s exactly how it goes, yeah. Just free association: what token arbitrarily comes to mind? Like if you stare at some static noise, you will see some sort of lumpiness or pattern, which won’t be the same as what someone else sees. There’s no explaining that at the conscious level. It’s closer to a hash function than any kind of ‘thinking’. You don’t ask what SHA is ‘thinking’ when you put in some text and it spits out some random numbers & letters. (You would see the same thing if you did a MLP or CNN on MNIST, say. The randomly initialized NN does not produce a uniform output across all digits, for all inputs, and that is the entire point of randomly initializing. As the AI koan goes...)
Do you have ideas about the mechanism by which models might be exploiting these spurious correlations in their weights? I can imagine this would be analogous to a human “going with their first thought” or “going with their gut”, but I have a hard time conceptualizing what that would look like for an LLM . If there is any existing research/writing on this, I’d love to check it out
The relevant research on ‘subliminal learning’: https://www.lesswrong.com/posts/cGcwQDKAKbQ68BGuR/subliminal-learning-llms-transmit-behavioral-traits-via (ie. acausal coordination through arbitrary initialization associations).
I think that’s exactly how it goes, yeah. Just free association: what token arbitrarily comes to mind? Like if you stare at some static noise, you will see some sort of lumpiness or pattern, which won’t be the same as what someone else sees. There’s no explaining that at the conscious level. It’s closer to a hash function than any kind of ‘thinking’. You don’t ask what SHA is ‘thinking’ when you put in some text and it spits out some random numbers & letters. (You would see the same thing if you did a MLP or CNN on MNIST, say. The randomly initialized NN does not produce a uniform output across all digits, for all inputs, and that is the entire point of randomly initializing. As the AI koan goes...)