Since they are predicting the text of other systems, it is hard to see any advantage for introspection.
I think this is an oversimplification. LLM base models are trained on the internet. The Internet contains a great many examples of humans doing introspection. Thus an LLM base model has very definitely been trained in predicting what answer an human would give if they introspected their own thought processes. But that is not training introspection: it’s training in acting.
Now, if LLMs thought processes were already sufficiently close to those of humans, then it might turn out that the shortest path to learning how to best imitate a human doing introspection was for an LLM to learn how to accurately introspect themselves. But to the extent that LLMs use different mechanisms to produce the same results as humans, I’d expect a base model’s introspection to actively attempt to conceal that.
What is less clear to me is whether a reasoning model with training in metacognitive tasks such as noticing when it’s confused and should start again, might get better at introspecting itself, as opposed to at figuring out whether a human would be confused in the same situation.
I think this is an oversimplification. LLM base models are trained on the internet. The Internet contains a great many examples of humans doing introspection. Thus an LLM base model has very definitely been trained in predicting what answer an human would give if they introspected their own thought processes. But that is not training introspection: it’s training in acting.
Now, if LLMs thought processes were already sufficiently close to those of humans, then it might turn out that the shortest path to learning how to best imitate a human doing introspection was for an LLM to learn how to accurately introspect themselves. But to the extent that LLMs use different mechanisms to produce the same results as humans, I’d expect a base model’s introspection to actively attempt to conceal that.
What is less clear to me is whether a reasoning model with training in metacognitive tasks such as noticing when it’s confused and should start again, might get better at introspecting itself, as opposed to at figuring out whether a human would be confused in the same situation.