I work primarily on AI Alignment. Scroll down to my pinned Shortform for an idea of my current work and who I’d like to collaborate with.
Website: https://jacquesthibodeau.com
Twitter: https://twitter.com/JacquesThibs
GitHub: https://github.com/JayThibs
Hey Bogdan, I’d be interested in doing a project on this or at least putting together a proposal we can share to get funding.
I’ve been brainstorming new directions (with @Quintin Pope) this past week, and we think it would be good to use/develop some automated interpretability techniques we can then apply to a set of model interventions to see if there are techniques we can use to improve model interpretability (e.g. L1 regularization).
I saw the MAIA paper, too; I’d like to look into it some more.
Anyway, here’s a related blurb I wrote:
Whether this works or not, I’d be interested in making more progress on automated interpretability, in the similar ways you are proposing.