I am not that excited about marginal interpretability research, but I have nevertheless linked to this a few times. I think this post both clarifies a bunch of inroads into making marginal interpretability progress, but also maps out how long the journey between where we are and where many important targets are for using interpretability methods to reduce AI x-risk.
Separately, besides my personal sense that marginal interpretability research is not a great use of most researcher’s time, there are really a lot of people trying to get started doing work on AI Alignment via interpretability research, and I think this kind of resource is very valuable for that work, and also helps connect interp work to specific risk models, which is a thing that I wish existed more on the margin in interpretability work.
I am not that excited about marginal interpretability research, but I have nevertheless linked to this a few times. I think this post both clarifies a bunch of inroads into making marginal interpretability progress, but also maps out how long the journey between where we are and where many important targets are for using interpretability methods to reduce AI x-risk.
Separately, besides my personal sense that marginal interpretability research is not a great use of most researcher’s time, there are really a lot of people trying to get started doing work on AI Alignment via interpretability research, and I think this kind of resource is very valuable for that work, and also helps connect interp work to specific risk models, which is a thing that I wish existed more on the margin in interpretability work.