This was actually my expectation when I started my research on introspection using the game paradigms early this year. Seeing some positive results (as well as in the work of others) has made me consider what mechanisms could enable it. I can imagine something in post-training, where the model is evaluating its own outputs, inducing models to form late-middle-layer representations that correspond to confidence (perhaps relating to the magnitude or sparsity of residual stream vectors) and late-layer mappings of those to decisions about whether to answer or admit ignorance (esp since this is something they are likely being specifically RL trained for). Maybe this sort of self-evaluative post-training even pushes response choices down to early layers, so that the final layers have more room for high-level contextual mappings, enabling the self-prediction effects seen in experiment 2 in my paper and (Binder et al propose something like this to explain what their targeted fine tuning is doing to allow self-prediction in their paradigm). So in both of those cases its about later layers learning to “introspect” on lower layers. It’s also possible that the attention weights are learning to extract such self-relevant signals from earlier in the context.
This was actually my expectation when I started my research on introspection using the game paradigms early this year. Seeing some positive results (as well as in the work of others) has made me consider what mechanisms could enable it. I can imagine something in post-training, where the model is evaluating its own outputs, inducing models to form late-middle-layer representations that correspond to confidence (perhaps relating to the magnitude or sparsity of residual stream vectors) and late-layer mappings of those to decisions about whether to answer or admit ignorance (esp since this is something they are likely being specifically RL trained for). Maybe this sort of self-evaluative post-training even pushes response choices down to early layers, so that the final layers have more room for high-level contextual mappings, enabling the self-prediction effects seen in experiment 2 in my paper and (Binder et al propose something like this to explain what their targeted fine tuning is doing to allow self-prediction in their paradigm). So in both of those cases its about later layers learning to “introspect” on lower layers. It’s also possible that the attention weights are learning to extract such self-relevant signals from earlier in the context.