I think there are achievable alignment paths that don’t flow through precise mechanistic interpretability. I should write about some of them.
Please do! I am very interested in this sort of thinking. Is there preexisting work you know of that runs along the lines of what you think could work?
Please do! I am very interested in this sort of thinking. Is there preexisting work you know of that runs along the lines of what you think could work?