Thanks for writing this post! I do think some people are pursuing interp for this “wrong reason” of trying to prevent scheming, and the road where you get interp to improve that well to make that work seems unlikely (understanding general circuits doesn’t seem impossible to me but extremely hard and nonzero people are working on this).
I think that perhaps the mistake comes from mistaking the simplicity of the optimizer for a property of the mesa-optimizer. SGD by backprop is one algorithm so people put a single label, “deep learning,” on all models it produces. But there is no reason that all of these models must use similar circuits. They may all use an array of unique fantastically complex circuits. Understanding every circuit that can be produced by SGD at once is not a cohesive research program, and it is not a plan that will succeed.
There would be reasons to believe that models are going to use similar algorithms if they use similar training data. Understanding every circuit that is possible to be produced by SGD given “infinite training data” seems intractable, but “in practice” I’d expect different algorithms produced by SGD to produce modular structures with common “motifs” just like evolution does. Evolutionary developmental biology is indeed a field (that just like interp seems more bottlenecked on better theory rather than measurement capabilities). It’s why I am still excited about developmental interpretability, even though I don’t have a coherent plan for how it will help us with safety beyond “more theory and foundations” seems nice (with the general caveat that I am very confused about capability externalities, but this seems kind of unavoidable for actually broad insights).
I am in the waiting room at the doctor and the ~1 year old child next to me is scrolling through YouTube shorts on his mother’s phone with the mom watching along. Incredibly incoherent AI slop. Not an expert at early development, but this seems very suboptimal. I know that scrolling is not good for my reward system, but a 1-year old? At least put some random video on instead of letting the child scroll? If I was in charge at Google and was living by “don’t be evil”, I would maybe make a classifier to identify children scrolling like this and giving a reminder once in a while to suggest some alternative activities to the parent?