I think this is a fun and (initially) counterintuitive result. I’ll try to frame things as it works in my head, it might help people understand the weirdness.
The task of the residual MLP (labelled CC Model here) is to solve y = x + ReLU(x). Consider the problem from the MLP’s perspective. You might think that the problem for the MLP is to just learn how to compute ReLU(x) for 100 input features with only 50 neurons. But given that we have this random WE matrix, the task is actually more complicated. Not only does the MLP have to compute ReLU(x), it also has to make up for the mess caused by WEWTE not being an identity.
But it turns out that making up for this mess actually makes the problem easier!
I think this is a fun and (initially) counterintuitive result. I’ll try to frame things as it works in my head, it might help people understand the weirdness.
The task of the residual MLP (labelled CC Model here) is to solve y = x + ReLU(x). Consider the problem from the MLP’s perspective. You might think that the problem for the MLP is to just learn how to compute ReLU(x) for 100 input features with only 50 neurons. But given that we have this random WE matrix, the task is actually more complicated. Not only does the MLP have to compute ReLU(x), it also has to make up for the mess caused by WEWTE not being an identity.
But it turns out that making up for this mess actually makes the problem easier!
Yes! But only if the mess is the residual stream, i.e. includes $x$! This is the heart of the necessary “feature mixing” we discuss