carboniferous_umbraculum comments on “Cars and Elephants”: a handwavy argument/analogy against mechanistic interpretability

carboniferous_umbraculum 1 Nov 2022 10:29 UTC
LW: 3 AF: 2
2
AF
I may come back to comment more or incorporate this post into something else I write but wanted to record my initial reaction which is that I basically believe the claim. I also think that the ‘unrelated bonus reason’ at the end is potentially important and probably deserves more thought.
- Arthur Conmy 2 Nov 2022 17:34 UTC
  1 point
  0
  AF Parent
  Disclaimer: I work on interpretability at Redwood Research. I am also very interested in hearing a fleshed-out version of this critique. To me, this is related to the critique of Redwood’s interpretability approach here, another example of “recruiting resources outside of the model alone”.
  
  (however, it doesn’t seem obvious to me that interpretability can’t or won’t work in such settings)
  - David Scott Krueger (formerly: capybaralet) 3 Nov 2022 19:24 UTC
    LW: 2 AF: 1
    0
    AF Parent
    It could work if you can use interpretability to effectively prohibit this from happening before it is too late. Otherwise, it doesn’t seem like it would work.