How does language model introspection work? What mechanisms could be at play?
‘Introspection’: When we ask a language model about its own capabilities, a lot of times this turns out to be a calibrated estimate of the actual capabilities. E.g. models ‘know what they know’, i.e. can predict whether they know answers to factual questions. Furthermore this estimate gets better when models have access to their previous (question, answer) pairs.
One simple hypothesis is that a language model simply infers the general level of capability from the previous text. Then we’d expect that more powerful language models are better at this than weaker ones. However, Owain’s work on introspection finds evidence to the contrary. This implies there must be ‘privileged information’.
Another possibility is that model simulates itself answering the inner question, and then uses that information to answer the outer question, similar to latent reasoning. If models really do this two-step computation, then it should be possible to recover a ‘bridge entity’ at some point in the representations.
It’s plausible that the privileged information is something ‘more abstract’, not amounting to a full simulation of the language model’s own forward pass, but nonetheless carrying useful information about it’s own level of capability.
Introspection is an instantiation of ‘Connecting the Dots’.
Connecting the Dots: train a model g on (x, f(x)) pairs; the model g can infer things about f.
Introspection: Train a model g on (x, f(x)) pairs, where x are prompts and f(x) are the model’s responses. Then the model can infer things about f. Note that here we have f = g, which is a special case of the above.
How does language model introspection work? What mechanisms could be at play?
‘Introspection’: When we ask a language model about its own capabilities, a lot of times this turns out to be a calibrated estimate of the actual capabilities. E.g. models ‘know what they know’, i.e. can predict whether they know answers to factual questions. Furthermore this estimate gets better when models have access to their previous (question, answer) pairs.
One simple hypothesis is that a language model simply infers the general level of capability from the previous text. Then we’d expect that more powerful language models are better at this than weaker ones. However, Owain’s work on introspection finds evidence to the contrary. This implies there must be ‘privileged information’.
Another possibility is that model simulates itself answering the inner question, and then uses that information to answer the outer question, similar to latent reasoning. If models really do this two-step computation, then it should be possible to recover a ‘bridge entity’ at some point in the representations.
It’s plausible that the privileged information is something ‘more abstract’, not amounting to a full simulation of the language model’s own forward pass, but nonetheless carrying useful information about it’s own level of capability.
Introspection is an instantiation of ‘Connecting the Dots’.
Connecting the Dots: train a model g on (x, f(x)) pairs; the model g can infer things about f.
Introspection: Train a model g on (x, f(x)) pairs, where x are prompts and f(x) are the model’s responses. Then the model can infer things about f. Note that here we have f = g, which is a special case of the above.