Alex Gibson comments on Alternative Models of Superposition

Alex Gibson 11 Aug 2025 18:44 UTC
8 points
1
The restriction of the loss to the target feels like cheating, to be honest. The linear model claim is scoped to reconstruction loss, where you genuinely don’t see superposition as far as i’m aware. And in this case, the reconstruction loss would be poor, because the vectors are nested so close to each other that adjacent features false fire.
I agree with the core point about finding alternative models of superposition though. As far as I know, there is no evidence that the Toy Model paper is accurate to how real models actually represent things, except at the broadest level. Towards Monosemanticity in fact notes divergence from the Toy Model paper (see Feature Splitting).
On the model itself, for $p = 0.0$ , and $p = 1.0$ , why can’t you place vectors equidistant around the circle, allowing for arbitrarily many features?
- zroe1 11 Aug 2025 19:02 UTC
  1 point
  0
  Parent
  To your point about the loss, I believe it’s absolutely correct that this is an entirely different setting than the linear models from TMS. I wouldn’t characterize this as cheating, because it feels entirely possible that models in practice have an effective mechanism for handling lots of interference, but admittedly, the fact that you only select the target feature is the difference that makes this experiment work at all.
  On the model itself, for $p = 0.0$ , and $p = 1.0$ , why can’t you place vectors equidistant around the circle, allowing for arbitrarily many features?
  If I understand this question correctly, for $p = 0.0$ it should be possible to have arbitrarily many features. In this setting, there is no possibility for interference so if you tune hyperparameters correctly, you should be able to get as many features as you want. Empirically, I didn’t find a clear limit but, at the very least I can say that you should be able to get “a lot.” Because all inputs are orthogonal, in this case, the results should be very similar to Superposition, Memorization, and Double Descent.
  
  $p = 1.0$ would be an interesting experiment that I didn’t run, but if I had to guess, the results wouldn’t be very clean because there would be quite a bit of interference on each training example.
  - Alex Gibson 11 Aug 2025 19:13 UTC
    1 point
    0
    Parent
    I guess I mean cheating purely as “I don’t think this applies to to the Toy Model setting”, as opposed to saying it’s not a potentially valuable loss to study.
    For p=1.0, I forgot that each of the noise features are random between 0 and 0.1, as opposed to fixed magnitude. The reason I brought it up is because if they were fixed magnitude 0.05 then they would all cancel out and face in the opposite direction to the target feature with magnitude 0.05. Now I reread the setting again I don’t think that’s relevant, though.
    - zroe1 11 Aug 2025 19:18 UTC
      1 point
      0
      Parent
      The reason I brought it up is because if they were fixed magnitude 0.05 then they would all cancel out and face in the opposite direction to the target feature with magnitude 0.05. Now i’m curious what the variance in noise looks like as a function of number of features if you place them equidistant.
      This is a very interesting thought! I think your intuition is probably correct even though it is somewhat counterintuitive. Perhaps I’ll run this experiment at some point.