Mechanistic Interpretability for the MLP Layers (rough early thoughts)

Link post

This post is a link to a video I just made in response to the new work coming out of Anthropic described here and here and discussed on LessWrong here.

In the video I try to puzzle through how best to think about the MLP layers of a transformer in the same spirit as Anthropic is thinking through the self-attention layers.