I left some comments here. My overall takeaway (as someone who hasn’t worked in the area but cares about alignment) is that I’m very excited about this kind of interpretability research and especially work that focuses on small parts of models without worrying too much about scalability. It seems like interpretability could provide compelling warnings about future risks, is a big part of many of the existing concrete stories for how we might get aligned AI, and is reasonably likely to be helpful in unpredictable ways.
I left some comments here. My overall takeaway (as someone who hasn’t worked in the area but cares about alignment) is that I’m very excited about this kind of interpretability research and especially work that focuses on small parts of models without worrying too much about scalability. It seems like interpretability could provide compelling warnings about future risks, is a big part of many of the existing concrete stories for how we might get aligned AI, and is reasonably likely to be helpful in unpredictable ways.