JennaS comments on Terrified Comments on Corrigibility in Claude’s Constitution

JennaS 17 Mar 2026 0:20 UTC
21 points
15
I think there are achievable alignment paths that don’t flow through precise mechanistic interpretability. I should write about some of them.
Please do! I am very interested in this sort of thinking. Is there preexisting work you know of that runs along the lines of what you think could work?