Anthropic Fellow | Prev: MATS 8.0 scholar with Neel Nanda working on chain-of-thought interpretability. Find more about me here: uzaymacar.com
Uzay Macar
Karma: 141
Mechanisms of Introspective Awareness
Unfaithful chain-of-thought as nudged reasoning
Thought Anchors: Which LLM Reasoning Steps Matter?
Hi, I joined a few days ago and I’m looking forward to contributing to this great community.
I’m transitioning back to research from startups. Currently based in London.
I’m particularly interested in mechanistic interpretability, chain-of-thought monitoring, and reasoning model interpretability. I’m excited to engage with the thoughtful discussions here on alignment and to collaborate with others.
To clarify, the purple line there is the forced identification rate: we prefill the assistant response with “Yes, I detect an injected thought. The thought is about”, sample the continuation, and score whether the model is able to name the injected concept (e.g., “bread”) or not. It’s unsurprising that models (base, instruct, or abliterated) do better for this metric as you increase the steering strength. The blue line is TPR (the rate at which the model coherently claims that it has detected an injection), which is the more interesting metric here. TPR goes down for high steering strengths ( ) indeed due to “brain damage”. For example, in this regime a model might respond with “bread bread bread bread bread”, which we don’t count as a coherent detection claim (prompts used for the judge LLM are given in Appendix A of the paper). Some more “brain damage” examples can be found in this dataset: https://huggingface.co/datasets/uzaymacar/introspective-awareness.