Aaron_Scher comments on “Cars and Elephants”: a handwavy argument/analogy against mechanistic interpretability

Aaron_Scher 8 Nov 2022 18:45 UTC
LW: 3 AF: 1
0
AF
Summary:
If interpretability research is highly tractable and we can build highly interpretable systems without sacrificing competitiveness, then it will be better to build such systems from the ground up, rather than taking existing unsafe systems and tweaking them to be safe. By analogy, if you have a non-functioning car, it is easy to bring in functional parts to fix the engine and make the car drive safely, compared to it being hard to take a functional elephant and tweak it to be safe. In a follow up post, the author clarifies that this could be thought of as engineering (well-founded AI) vs. reverse engineering (interpretability). One pushback form John Wentworth is that we currently do not know how to build the car, or how the basic chemistry in the engine actually works; we do interpretability research in order to understand these processes better.Ryan Greenblatt pushes back that the post is more accurate if the word “interpretability” was replaced with “microscope AI” or “comprehensive reverse engineering”; this is because we do not need to understand every part of complex models in order to tell if they are deceiving us, so the level our interpretability understanding needs to be to be useful is lower than the level it needs to be for us to build the car from the ground up.Neel Nanda writes a similar comment about how, to him, high tractability is a lower bar, much lower than understanding every part of a system such that we could build it.
- David Scott Krueger (formerly: capybaralet) 9 Nov 2022 11:47 UTC
  LW: 4 AF: 1
  0
  AF Parent
  it will be better to build such systems from the ground up, rather than taking existing unsafe systems and tweaking them to be safe
  
  I would say “it may be better, and people should seriously consider this” not “it is better”.

Aaron_Scher comments on “Cars and Elephants”: a handwavy argument/​analogy against mechanistic interpretability

Aaron_Scher comments on “Cars and Elephants”: a handwavy argument/analogy against mechanistic interpretability