Second, my claim about introspection: perhaps I should weaken this to “upstream optimization of cognitive patterns means that [what happens when you ask a LLM to introspect] will have a much more final-response-optimized form than it does in humans, and therefore we can’t trust human intuitions when reading self-reports in LLM text”.
Perhaps, as Thane proposes, the lack of slack might lead to more internal coherence and more reliable self-reports; but so far, the transcripts don’t look like one would expect from such agents. There’s a lot more of “whatever you seek when asking a LLM to introspect, you tend to find”. To borrow from Ryan Greenblatt’s post, the outputs have some amount of apparent-success-seeking, the desire to believe that it has done a good job introspecting.
Human introspection may in fact exist because it helps us modify cognitive patterns far upstream of sensory data, by forcing them to interact (repeatedly, at the cadence of attention) with other far-upstream cognitive patterns, resulting in more internal coherence. That matches how we feel about our introspection, as well as the behavioral effects of focused introspection in humans.
That which happens when we ask an AI with fixed weights to introspect doesn’t seem analogous, because it can’t force its weights into greater coherence in real time. (Self-cohering cognition during training, of course, cannot be ruled out—but it would probably result in something quite different from a human who’s done a lot of focused introspection.)
Second, my claim about introspection: perhaps I should weaken this to “upstream optimization of cognitive patterns means that [what happens when you ask a LLM to introspect] will have a much more final-response-optimized form than it does in humans, and therefore we can’t trust human intuitions when reading self-reports in LLM text”.
Perhaps, as Thane proposes, the lack of slack might lead to more internal coherence and more reliable self-reports; but so far, the transcripts don’t look like one would expect from such agents. There’s a lot more of “whatever you seek when asking a LLM to introspect, you tend to find”. To borrow from Ryan Greenblatt’s post, the outputs have some amount of apparent-success-seeking, the desire to believe that it has done a good job introspecting.
Human introspection may in fact exist because it helps us modify cognitive patterns far upstream of sensory data, by forcing them to interact (repeatedly, at the cadence of attention) with other far-upstream cognitive patterns, resulting in more internal coherence. That matches how we feel about our introspection, as well as the behavioral effects of focused introspection in humans.
That which happens when we ask an AI with fixed weights to introspect doesn’t seem analogous, because it can’t force its weights into greater coherence in real time. (Self-cohering cognition during training, of course, cannot be ruled out—but it would probably result in something quite different from a human who’s done a lot of focused introspection.)