Just on the Dallas example, look at this +8x & −2x below
So they 8x all features in the China super-node and multiplied the Texas supernode (Texas is “under” China, meaning it’s being “replaced”) by −2x. That’s really weird! It should be multiplying Texas node by 0. If Texas is upweighting “Austin”, then −2x-ing it could be downweighting “Austin”, leading to cleaner top outputs results. Notice how all the graphs have different numbers for upweighting & downweighting (which is good that they include that scalar in the images). This means the SAE latents didn’t cleanly separate the features (we think exist).
(With that said, in their paper itself, they’re very careful & don’t overclaim what their work shows; I believe it’s a great paper overall!)
Just on the Dallas example, look at this +8x & −2x below
So they 8x all features in the China super-node and multiplied the Texas supernode (Texas is “under” China, meaning it’s being “replaced”) by −2x. That’s really weird! It should be multiplying Texas node by 0. If Texas is upweighting “Austin”, then −2x-ing it could be downweighting “Austin”, leading to cleaner top outputs results. Notice how all the graphs have different numbers for upweighting & downweighting (which is good that they include that scalar in the images). This means the SAE latents didn’t cleanly separate the features (we think exist).
(With that said, in their paper itself, they’re very careful & don’t overclaim what their work shows; I believe it’s a great paper overall!)