I may come back to comment more or incorporate this post into something else I write but wanted to record my initial reaction which is that I basically believe the claim. I also think that the ‘unrelated bonus reason’ at the end is potentially important and probably deserves more thought.
Disclaimer: I work on interpretability at Redwood Research. I am also very interested in hearing a fleshed-out version of this critique. To me, this is related to the critique of Redwood’s interpretability approach here, another example of “recruiting resources outside of the model alone”.
(however, it doesn’t seem obvious to me that interpretability can’t or won’t work in such settings)
It could work if you can use interpretability to effectively prohibit this from happening before it is too late. Otherwise, it doesn’t seem like it would work.
I may come back to comment more or incorporate this post into something else I write but wanted to record my initial reaction which is that I basically believe the claim. I also think that the ‘unrelated bonus reason’ at the end is potentially important and probably deserves more thought.
Disclaimer: I work on interpretability at Redwood Research. I am also very interested in hearing a fleshed-out version of this critique. To me, this is related to the critique of Redwood’s interpretability approach here, another example of “recruiting resources outside of the model alone”.
(however, it doesn’t seem obvious to me that interpretability can’t or won’t work in such settings)
It could work if you can use interpretability to effectively prohibit this from happening before it is too late. Otherwise, it doesn’t seem like it would work.