Neel Nanda comments on Interpreting Neural Networks through the Polytope Lens

Neel Nanda 24 Sep 2022 21:46 UTC
LW: 2 AF: 1
0
AF
To verify this claim, here we collect together activations in a) the channel dimension in InceptionV1 and b) various MLP layers in GPT2 and cluster them using HDBSCAN, a hierarchical clustering technique
Are the clusters in this section clustering the entire residual stream (ie a vector) or the projection onto a particular direction? (ie a scalar)
- Lee Sharkey 27 Sep 2022 19:30 UTC
  LW: 3 AF: 1
  0
  AF Parent
  For GPT2-small, we selected 6/1024 tokens in each sequence (evenly spaced apart and not including the first 100 tokens), and clustered on the entire MLP hidden dimension (4 * 768).
  
  For InceptionV1, we clustered the vectors corresponding to all the channel dimensions for a single fixed spatial dimension (i.e. one example of size [n_channels] per image).
  - Neel Nanda 28 Sep 2022 8:33 UTC
    LW: 2 AF: 1
    0
    AF Parent
    Thanks! So, I was trying to disentangle the two claims of “if examples are semantically similar (either similar patches of images, or words in similar contexts re predicting the next token) the model learns to map their full representations to be close to each other” and the claim of “if we pick a specific direction, the projection onto this direction is polysemantic. But it actually intersects many meaningful polytopes, and if we cluster the projection onto this direction (converting examples to scalars) we get clusters in this 1D space, and each cluster is clearly meaningful”. IMO the first is pretty intuitive, while the second would be surprising and strong evidence for the polytope hypothesis. If I’m understanding correctly, you’re presenting the first one here? Did you investigate the second at all? I’d love to see the results!