Neel Nanda comments on Interpreting Neural Networks through the Polytope Lens

Neel Nanda 28 Sep 2022 8:33 UTC
LW: 2 AF: 1
0
AF
Thanks! So, I was trying to disentangle the two claims of “if examples are semantically similar (either similar patches of images, or words in similar contexts re predicting the next token) the model learns to map their full representations to be close to each other” and the claim of “if we pick a specific direction, the projection onto this direction is polysemantic. But it actually intersects many meaningful polytopes, and if we cluster the projection onto this direction (converting examples to scalars) we get clusters in this 1D space, and each cluster is clearly meaningful”. IMO the first is pretty intuitive, while the second would be surprising and strong evidence for the polytope hypothesis. If I’m understanding correctly, you’re presenting the first one here? Did you investigate the second at all? I’d love to see the results!