Just to check, in the toy scenario, we assume the features in R^n are the coordinates in the default basis. So we have n features X_1, …, X_n
Yes, that’s correct.
Separately, do you have intuition for why they allow network to learn b too? Why not set b to zero too?
My understanding is that the bias is thought to be useful for two reasons:
It is preferable to be able to output a non-zero value for features the model chooses not to represent (namely their expected values)
Negative bias allows the model to zero-out small interferences, by shifting the values negative such that the ReLU outputs zero. I think empirically when these toy models are exhibiting lots of superposition, the bias vector typically has many negative entries.
Yes, that’s correct.
My understanding is that the bias is thought to be useful for two reasons:
It is preferable to be able to output a non-zero value for features the model chooses not to represent (namely their expected values)
Negative bias allows the model to zero-out small interferences, by shifting the values negative such that the ReLU outputs zero. I think empirically when these toy models are exhibiting lots of superposition, the bias vector typically has many negative entries.