Exciting New Interpretability Paper!

There’s a pretty exciting new interpretability paper, which hasn’t really received the requisite attention because it’s not billed as such.

This paper modifies the transformer architecture so that a forward pass minimizes a specifically engineered energy function.

According to the paper, “This functionality makes it possible to visualize essentially any token representation, weight, or gradient of the energy directly in the image plane. This feature is highly desirable from the perspective of interpretability, since it makes it possible to track the updates performed by the network directly in the image plane as the computation unfolds in time”.

They achieve SOTA on two of the domains they tested on, although they didn’t test on NLP or CV tasks (which is why the paper was rejected, I believe the authors will resubmit again with more experiments.)

More generally, I think architectures such as the above that essentially give you interpretability for free are a promising research direction.