Sheikh Abdur Raheem Ali comments on Mechanisms of Introspective Awareness

Sheikh Abdur Raheem Ali 25 Apr 2026 5:01 UTC
1 point
0
I see, thank you for sharing these details, it helps clarify your original comment. I’m not certain I follow how it links to this piece, or the related work that you cite. I read that Anthropic post when it was released and worked on some follow up experiments whose results I never properly wrote up, so you may assume I have some familiarity with it, but I don’t have a thorough understanding— and I’m still working through the paper linked to this post as well, so feel free to quote specific sections I might have missed that make the bridge clearer.
From memory I recall there is a section on optical illusions where certain tokens common in code, such as @@, can fool the model. But it is a known fact that minor differences in punctuation can lead to an entirely different tokenization of the input sequence, and affect task performance in surprising ways. The llama 3 tokenizer is particularly cursed (see https://github.com/belladoreai/llama3-tokenizer-js/blob/master/src/llama3-tokenizer.js), but this has been true for LLMs since gpt2 or even earlier, with some caveats that more capable models tend to treat semantically equivalent sequences in more consistent ways (I don’t have a citation on hand for this).
However, the change you describe (swapping an ellipsis for a question mark) does change the meaning of a sentence in a way that appending a newline or prepending a bos token wouldn’t, so I think that it is expected behavior for a model to pick up on the break in the pattern on that turn (and for this to be detectable using white box methods). How does this relate to the weak evidence carrier features in the introspection circuit, or to the boundary detector heads and the twisting of the characters remaining/line position manifold from the linebreaks paper?
- flatstats 26 Apr 2026 19:56 UTC
  1 point
  0
  Parent
  Okay yeah I agree punctuation matters semantically and that some model behavior is expected. The reason I don’t think it’s strictly that is, in at least one of my project’s runs, the visible response at turn index 10 did not change at all, while the internal attention/geometry deltas still spiked. The behavioral difference only surfaced at turn index 11. So this would mean surface level equivalent completions can still carry state divergence forward.
  This is why I am comparing it to “When Models Manipulate Manifolds” paper, its not that question marks or ellipsis are exactly like newlines in content, its because both are structural markers that can trigger internal state updates. This particular paper gives a mechanistic example where heads route boundary information through geometry.
  And how it connects to “Mechanisms of Introspective Awareness” is that their evidence carriers/gates show how weak distributed internal signals can be present before or apart from straightforward output behavior.
  So it makes me think of these heads being possible candidate upstream routers rather than the final detector. Their detection mechanism is mostly localized to distributed mid-to-late MLP computation, with weak evidence carriers and gate features. If punctuation and discourse transitions consistently recruit the same mid-to-late attention families, those heads may function as stable upstream routers for structural state information. Their outputs could then feed into later MLP features that integrate weak signals into reportable evidence.