Dan Braun comments on Compressed Computation is (probably) not Computation in Superposition

Dan Braun 24 Jun 2025 14:17 UTC
3 points
2
I think this is a fun and (initially) counterintuitive result. I’ll try to frame things as it works in my head, it might help people understand the weirdness.
The task of the residual MLP (labelled CC Model here) is to solve y = x + ReLU(x). Consider the problem from the MLP’s perspective. You might think that the problem for the MLP is to just learn how to compute ReLU(x) for 100 input features with only 50 neurons. But given that we have this random $W_{E}$ matrix, the task is actually more complicated. Not only does the MLP have to compute ReLU(x), it also has to make up for the mess caused by $W_{E} W_{E}^{T}$ not being an identity.
But it turns out that making up for this mess actually makes the problem easier!
- Jai Bhagat 30 Jun 2025 7:30 UTC
  1 point
  0
  Parent
  Yes! But only if the mess is the residual stream, i.e. includes $x$! This is the heart of the necessary “feature mixing” we discuss