Nick Merrill

Karma: 18

Research at the Forecasting Research Institute. Previously U.C. Berkeley Center for Long-Term Cybersecurity. I’m interested in interpretability, particularly introspection and introspective access. https://else.how

Nick Merrill 6 Jul 2026 16:36 UTC
1 point
1
in reply to: agastyasridharan’s comment on: Emergent introspection does not replicate on Llama-3.1-405B
I like the entropy+leakage account! But, with no Anthropic model in the sample, that result can’t speak to the post-training claim. The decisive experiment would be your mismatch test on Claude… but only someone at Anthropic could run it.

Defeating Introspection Adapters (and Why Threat Models Matter)

Nick Merrill and zekem

4 Jun 2026 18:39 UTC

10 points

0 comments5 min readLW link

Nick Merrill 4 Jun 2026 0:08 UTC
3 points
−1
on: Leveraging Introspection for Alignment
One outstanding question I’ve had about introspection is: what part of the model is doing the introspection? In humans, this might be the prefrontal cortex.
In the extreme case (in which models can introspect arbitrarily well), this component is the only part of the model that needs to be aligned: through introspection, it can clean up the rest. In other words, good introspection could reduce the ‘surface area’ for alignment considerably.

Emergent introspection does not replicate on Llama-3.1-405B

Nick Merrill11 May 2026 4:05 UTC

9 points

3 comments6 min readLW link