Mark Nelson comments on Interpretability Will Not Reliably Find Deceptive AI

Mark Nelson 7 May 2025 16:59 UTC
1 point
0
Thank you Neel! I really appreciate the encouraging reply. Your point about needing a larger toolbox resonated particularly strongly for me. Mine is limited. I don’t have access to Anthropic’s circuit tracing (I desperately wish I did!) and I am not ready yet to try sampling logits or attention weights myself. Thus, I’ve been trying to understand how far I can reasonably go with blackbox testing using repetition across models, temperatures, and prompts. For now I’m focusing solely on analysis of the response, despite the limitation. I really do need your portfolio of imperfect defenses (though in my mind this toolbox is more than just defenses!). If you built it, I would use it.