agastyasridharan

Karma: 26

agastyasridharan 27 Jun 2026 19:49 UTC
4 points
2
in reply to: ajskateboarder’s comment on: Introspection or entropy? Re-examining concept-injection “introspection” in open models
Yes and no. A few things to note:
1. I tested a much wider range of models: 14 open-weight models across 5 families (Gemma, Llama, Qwen, Mistral, OLMo). Small Models Can Introspect, Too seems to only test Qwen2.5-Coder-32B. I also tested a much wider range of concepts (50) and injection strengths (3.5, 4.0, 4.5, 5.0, 6.0) with 7 injection strenghts.
2. I measure logits, not probabilities. I think this is a far more accurate metric, for reasons I explain here:
The same issue applies to other probability-based controls. Probability is a nonlinear function of the logits: the same logit shift can look tiny when the baseline probability of “YES” is near zero, but enormous when the model is near the decision boundary. For this reason, I think the cleaner test is to compare baseline-corrected changes in across the detection question and factual controls.
1. Most importantly, some models do show a much larger detection shift than factual shift, and there is a lot of variation! But I find you can explain the shifts themselves from a simpler property of the prompt, the model’s baseline YES/NO confidence. See: experiment 2 and 3 above.
But I’d love to hear @vgel’s thoughts.

agastyasridharan 27 Jun 2026 19:27 UTC
3 points
0
on: Leveraging Introspection for Alignment
Let’s posit that we have an LLM that can reliably name its own functional emotions when prompted. How do we use this to help the model be more aligned? I’ll sketch a number of rough ideas, each of which would need more consideration to become useful.
I’m generally skeptical of evidence of introspective capabilities in LLMs. But even setting that aside, I don’t really understand why ‘introspection’ would ever be theoretically preferable to external white-box methods.

Let be the model’s full hidden-state trajectory and be its introspective report. For a fixed model and prompting protocol, is generated from by the model’s own computation, so (possibly stochastically). That means activation access can in principle simulate introspection, whereas introspection cannot in general recover the full activations. In that sense, introspection seems to be a lossy post-processing of the activations.
The primary reason we care about introspection in humans is privileged access. Self-reports of pain, pleasure, or emotion are often reliable (maybe even irrefutable), and they give us an inner window onto internal states we otherwise can’t access, since we can’t simulate the brain’s neural circuits. But with LLMs, we have access to every neuron/neural pathway. So it seems intuitive that whatever functional benefit introspection offers is better accessed by extrospective methods (AOs, NLAs, etc.) that work from the base model and are finetuned or RL’d for the express purpose of interpreting certain activations. (Separately, there’s convincing evidence that LLMs do not possess privileged access.)
The introspection approach also has two intuitive downsides:
a) Reports are mediated by the model’s output policy, so they can omit, distort, or strategically misrepresent (and, of course, hallucinate!)
b) Architecturally. genuine introspection would require some circuit that translates internal states directly into text. Such a circuit presumably can’t run at every layer, since that would be computationally intractable, and to the extent it existed, it would just be functionally replicating an AO/NLA of sorts inside the model anyway.
This applies directly to the prompting ideas.
- A system prompt could tell the model to speak up any time it’s “feeling” desperate, angry, distressed, etc., and try to collaborate with the user to address the cause.
- A system prompt could tell the model to notice when it has “mixed feelings” and reason explicitly about how to resolve them.
- Agents being given a large task could be told, “If you start to feel distressed, pause the task and come back to me.”
- When a model is behaving strangely, a user could ask how it is feeling, which might lead to diagnostically useful information
We could do all of this with extrospective methods instead: identify when concerning vectors flare up and intervene, without relying on an unreliable self-report (and just pass the readings from those methods back to the model.)