Disclaimer: I work on interpretability at Redwood Research. I am also very interested in hearing a fleshed-out version of this critique. To me, this is related to the critique of Redwood’s interpretability approach here, another example of “recruiting resources outside of the model alone”.
(however, it doesn’t seem obvious to me that interpretability can’t or won’t work in such settings)
It could work if you can use interpretability to effectively prohibit this from happening before it is too late. Otherwise, it doesn’t seem like it would work.
Disclaimer: I work on interpretability at Redwood Research. I am also very interested in hearing a fleshed-out version of this critique. To me, this is related to the critique of Redwood’s interpretability approach here, another example of “recruiting resources outside of the model alone”.
(however, it doesn’t seem obvious to me that interpretability can’t or won’t work in such settings)
It could work if you can use interpretability to effectively prohibit this from happening before it is too late. Otherwise, it doesn’t seem like it would work.