RSS

Nick Merrill

Karma: 15

Research at the Forecasting Research Institute. Previously U.C. Berkeley Center for Long-Term Cybersecurity. I’m interested in interpretability, particularly introspection and introspective access. https://​​else.how

Defeat­ing In­tro­spec­tion Adapters (and Why Threat Models Mat­ter)

4 Jun 2026 18:39 UTC
9 points
0 comments5 min readLW link

Emer­gent in­tro­spec­tion does not repli­cate on Llama-3.1-405B

Nick Merrill11 May 2026 4:05 UTC
9 points
0 comments6 min readLW link