I think we were fairly confident it was going to be the MLP blocks, but attention also has a non-linearity via the softmax.
I think we were fairly confident it was going to be the MLP blocks, but attention also has a non-linearity via the softmax.