307th comments on Activation space interpretability may be doomed

307th 9 Jan 2025 14:08 UTC
6 points
1
Nice post! I think these are good criticisms that don’t justify the title. Points 1 through 4 are all (specific, plausible) examples of ways we may interpret the activation space incorrectly. This is worth keeping in mind, and I agree that just looking at the activation space of a single layer isn’t enough, but it still seems like a very good place to start.
A layer’s activation is a relatively simple space, constructed by the model, that contains all the information that the model needs to make its prediction. This makes it a great place to look if you’re trying to understand how the model’s thinking.