Cole Wyeth comments on Against Almost Every Theory of Impact of Interpretability

Cole Wyeth 22 Aug 2023 16:27 UTC
6 points
0
I roughly agree with the case made here because I expect interpretability research to be much, much harder than others seem to appreciate. This is a consequence of strong intuitions from working on circuit complexity. Figuring out the behavior of a general circuit sounds like it’s in a very hard complexity class - even writing down the truth table for a circuit takes exponential time in the number of inputs! I would be surprised if coming up with a human interpretable explanation of sub circuits is easy; there are some reasons to believe that SGD will usually produce simple circuits so some success in the average case is possible (see recent work of Ard Louis), but it would be pretty shocking if the full problem had a solution fast enough to run on the huge transformer circuits we are dealing with.
I outlined this position (and pointed out that there is some hope of at least understanding some individual circuits and learning about intelligence) here: https://www.lesswrong.com/posts/RTmFpgEvDdZMLsFev/mechanistic-interpretability-is-being-pursued-for-the-wrong
(Not my best writing though)