For example, I think there is some chance that Neel Nanda’s mechanistic analysis of grokking will lead to capability improvements in the long run.
I’m curious if you have a particular concern in mind here?
My personal take is that this is the kind of interpretability work where I’m least concerned about it leading to capabilities improvements, since it’s very specific to toy models and analysing deep learning puzzles, and pretty far from the state of the art frontier.
In a world where it does lead to advancements, my best guess is that it follows a pretty indirect and diffuse trajectory (eg, furthers science of deep learning studies which lead to new insights that let us build better models, or get more people excited about interpretability which leads to more research and some of that advances capabilities), which seems extremely hard to model. I’d guess the alignment benefits of the work are minor to moderate (definitely not the interpretability work I think is most relevant to pushing on reducing x-risk, but likely somewhat useful), and strongly outweigh this kind of concern about diffuse and hard-to-predict effects
Just to be clear, I also think that your grokking work increases alignment much more than capabilities on balance.
I think the way in which it increases capabilities would roughly look like this: “your insight on grokking is a key to understanding fast generalization better; other people build on this insight and then modify training; this improves the speed of learning and thus capabilities”.
I think your work is clearly net positive, I just wanted to use a concrete example in the post to show that there are trade-offs worth taking.
I’m curious if you have a particular concern in mind here?
My personal take is that this is the kind of interpretability work where I’m least concerned about it leading to capabilities improvements, since it’s very specific to toy models and analysing deep learning puzzles, and pretty far from the state of the art frontier.
In a world where it does lead to advancements, my best guess is that it follows a pretty indirect and diffuse trajectory (eg, furthers science of deep learning studies which lead to new insights that let us build better models, or get more people excited about interpretability which leads to more research and some of that advances capabilities), which seems extremely hard to model. I’d guess the alignment benefits of the work are minor to moderate (definitely not the interpretability work I think is most relevant to pushing on reducing x-risk, but likely somewhat useful), and strongly outweigh this kind of concern about diffuse and hard-to-predict effects
Just to be clear, I also think that your grokking work increases alignment much more than capabilities on balance.
I think the way in which it increases capabilities would roughly look like this: “your insight on grokking is a key to understanding fast generalization better; other people build on this insight and then modify training; this improves the speed of learning and thus capabilities”.
I think your work is clearly net positive, I just wanted to use a concrete example in the post to show that there are trade-offs worth taking.