But there’s no reason to think that the model is actually using a sparse set of components /features on any given forward pass.
I contest this. If a model wants to implement more computations (for example, logic gates) in a layer than that layer has neurons, the known methods for doing this rely on few computations being used (that is, receiving a non-baseline input) on any given forward pass.
I contest this. If a model wants to implement more computations (for example, logic gates) in a layer than that layer has neurons, the known methods for doing this rely on few computations being used (that is, receiving a non-baseline input) on any given forward pass.