Daniel Murfet comments on Against Almost Every Theory of Impact of Interpretability

Daniel Murfet 18 Aug 2023 9:20 UTC
10 points
5
Induction heads? Ok, we are maybe on track to retro engineer the mechanism of regex in LLMs. Cool.
This dramatically undersells the potential impact of Olsson et al. You can’t dismiss modus ponens as “just regex”. That’s the heart of logic!
For many the argument for AI safety being a urgent concern involves a belief that current systems are, in some rough sense, reasoning, and that this capability will increase with scale, leading to beyond human-level intelligence within a timespan of decades. Many smart outsiders remain sceptical, because they are not convinced that anything like reasoning is taking place.
I view Olsson et al as nontrivial evidence for the emergence of internal computations resembling reasoning, with increasing scale. That’s profound. If that case is made stronger over time by interpretability (as I expect it to be) the scientific, philosophical and societal impact will be immense.