That’s very interesting in the second article that the model could predict it’s own future behaviors better than one that hadn’t been.
Models only exhibit introspection on simpler tasks. Our tasks, while demonstrating introspection, do not have practical applications. To find out what a model does in a hypothetical situation, one could simply run the model on that situation – rather than asking it to make a prediction about itself (Figure 1). Even for tasks like this, models failed to outperform baselines if the situation involves a longer response (e.g. generating a movie review) – see Section 4. We also find that models trained to self-predict (which provide evidence of introspection on simple tasks) do not have improved performance on out-of-distribution tasks that are related to self-knowledge (Section 4).
This is very strange because it seems like humans find it easier to introspect on bigger or more high level experiences like feelings or the broad narratives of reaching decisions more than, say, how they recalled how to spell that word. It looks like the reverse.
I’d have to look back at the methodology to be sure, but on the assumption that they have the model answer immediately without any chain of thought, my default guess about this is that it’s about the limits of what can be done in one or a small number of forward passes. If it’s the case that the model is doing some sort of internal simulation of its own behavior on the task, that seems like it might require more steps of computation than just a couple of forward passes allow. Intuitively, at least, this sort of internal simulation is what I imagine is happening when humans do introspection on hypothetical situations.
If on the other hand the model is using some other approach, maybe circuitry developed for this sort of purpose, then I would expect that maybe that approach can only handle pretty simple problems, since it has to be much smaller than the circuitry developed for actually handling a very wide range of tasks, ie the rest of the model.
I think there’s starting to be evidence that models are capable of something like actual introspection, notably ‘Tell me about yourself: LLMs are aware of their learned behaviors’ and (to a more debatable extent) ‘Looking Inward: Language Models Can Learn About Themselves by Introspection’. That doesn’t necessarily mean that it’s what’s happening here, but I think it means we should at least consider it possible.
That’s very interesting in the second article that the model could predict it’s own future behaviors better than one that hadn’t been.
This is very strange because it seems like humans find it easier to introspect on bigger or more high level experiences like feelings or the broad narratives of reaching decisions more than, say, how they recalled how to spell that word. It looks like the reverse.
I’d have to look back at the methodology to be sure, but on the assumption that they have the model answer immediately without any chain of thought, my default guess about this is that it’s about the limits of what can be done in one or a small number of forward passes. If it’s the case that the model is doing some sort of internal simulation of its own behavior on the task, that seems like it might require more steps of computation than just a couple of forward passes allow. Intuitively, at least, this sort of internal simulation is what I imagine is happening when humans do introspection on hypothetical situations.
If on the other hand the model is using some other approach, maybe circuitry developed for this sort of purpose, then I would expect that maybe that approach can only handle pretty simple problems, since it has to be much smaller than the circuitry developed for actually handling a very wide range of tasks, ie the rest of the model.