Is there a list of projects to get started with Interpretability?

Franziska Fischer7 Sep 2022 4:27 UTC

8 points

Big parts of Alignment outreach is done by trying to draw CS students into the filed, which gets even more present in the light of the EAGx’s that happen around the world right now. There are endless courses that teach basic skills in the field, loads of research agendas of the field, lists of specific open questions and posts giving a high-level overview.

Now if you have gotten some general overview, it naturally gets interesting to learn about something very specific by actively doing a specific technical project, however I feel like there’s a lack of formulated problems such as this one from Buck Shlegris, one get just get started with.

More specifically: We are a group of 3 CS undergrads (3rd and 4th year), who have done some of the big picture general AI alignment stuff, probably could go much deeper & broader in the general alignment resources by just passively reading but for now want to dive deeper into Interpretability research, not by keeping reading, but by actively implementing something and getting started hands-on.

Does anyone have a list of projects we could look at like the one of Buck (maybe a bit longer or more focused on interpretability), mainly focused on providing a good learning experience? Or can point me to some training projects? I think this kind of list might be helpful for loads of student interested in alignment research and the fact that effective thesis does not feature something like this is unfortunate

Franziska Fischer7 Sep 2022 4:27 UTC

8 points

2 comments1 min readLW link

AI Community

Zac Hatfield-Dodds 7 Sep 2022 7:15 UTC
6 points
0
I’d love to see some replications of Anthropic’s Induction Heads paper—it’s based around models small enough to train on a single machine (and reasonable budget for students!), related to cutting-edge interpretability, and has an explicit “Unexplained Curiosities” section listing weird things to investigate in future work.

For readers not focussed on interpretability, I’d note that ‘scaling laws go down as well as up’ - you can do relevant work even on very small models, if you design the experiment well. Two I’d love to see are a replication of BERTs of a feather do not generalize together; and some experiments on arithmetic as a small proxy for code models (c.f. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets) where you can investigate scaling laws for generalization across fraction of data, number of terms, number of digits, fine-tuning required to add new operators, whether this changes with architectures, etc etc.

(opinions my own, not speaking for my employer, you know the drill.)
Esben Kran 8 Sep 2022 13:55 UTC
2 points
0
There are a few interpretability ideas on aisafetyideas.com, e.g. mechanistic interpretability list, blackbox investigations list, and automating auditing list.