Thanks! So, I was trying to disentangle the two claims of “if examples are semantically similar (either similar patches of images, or words in similar contexts re predicting the next token) the model learns to map their full representations to be close to each other” and the claim of “if we pick a specific direction, the projection onto this direction is polysemantic. But it actually intersects many meaningful polytopes, and if we cluster the projection onto this direction (converting examples to scalars) we get clusters in this 1D space, and each cluster is clearly meaningful”. IMO the first is pretty intuitive, while the second would be surprising and strong evidence for the polytope hypothesis. If I’m understanding correctly, you’re presenting the first one here? Did you investigate the second at all? I’d love to see the results!
Thanks! So, I was trying to disentangle the two claims of “if examples are semantically similar (either similar patches of images, or words in similar contexts re predicting the next token) the model learns to map their full representations to be close to each other” and the claim of “if we pick a specific direction, the projection onto this direction is polysemantic. But it actually intersects many meaningful polytopes, and if we cluster the projection onto this direction (converting examples to scalars) we get clusters in this 1D space, and each cluster is clearly meaningful”. IMO the first is pretty intuitive, while the second would be surprising and strong evidence for the polytope hypothesis. If I’m understanding correctly, you’re presenting the first one here? Did you investigate the second at all? I’d love to see the results!