Uzay Macar

Karma: 141

Anthropic Fellow | Prev: MATS 8.0 scholar with Neel Nanda working on chain-of-thought interpretability. Find more about me here: uzaymacar.com

Uzay Macar 15 Apr 2026 21:24 UTC
2 points
0
in reply to: Grace Kind’s comment on: Mechanisms of Introspective Awareness
To clarify, the purple line there is the forced identification rate: we prefill the assistant response with “Yes, I detect an injected thought. The thought is about”, sample the continuation, and score whether the model is able to name the injected concept (e.g., “bread”) or not. It’s unsurprising that models (base, instruct, or abliterated) do better for this metric as you increase the steering strength. The blue line is TPR (the rate at which the model coherently claims that it has detected an injection), which is the more interesting metric here. TPR goes down for high steering strengths () indeed due to “brain damage”. For example, in this regime a model might respond with “bread bread bread bread bread”, which we don’t count as a coherent detection claim (prompts used for the judge LLM are given in Appendix A of the paper). Some more “brain damage” examples can be found in this dataset: https://huggingface.co/datasets/uzaymacar/introspective-awareness.

Mechanisms of Introspective Awareness

Uzay Macar14 Apr 2026 16:23 UTC

74 points

22 Jul 2025 22:35 UTC

54 points

2 Jul 2025 20:16 UTC

35 points

(www.thought-anchors.com)