Arthur Conmy comments on “Cars and Elephants”: a handwavy argument/analogy against mechanistic interpretability

Arthur Conmy 2 Nov 2022 17:34 UTC
1 point
0
AF
Disclaimer: I work on interpretability at Redwood Research. I am also very interested in hearing a fleshed-out version of this critique. To me, this is related to the critique of Redwood’s interpretability approach here, another example of “recruiting resources outside of the model alone”.

(however, it doesn’t seem obvious to me that interpretability can’t or won’t work in such settings)
- David Scott Krueger (formerly: capybaralet) 3 Nov 2022 19:24 UTC
  LW: 2 AF: 1
  0
  AF Parent
  It could work if you can use interpretability to effectively prohibit this from happening before it is too late. Otherwise, it doesn’t seem like it would work.