It is not known whether the inductive bias of neural network training contains a preference for run-time error-correction. The phenomenon of “backup heads” observed in transformers seems like a good candidate. Can you think of others?
I’ve heard thirdhand (?) of a transformer whose sublayers L(h)=o will dampen their outputs when o is added to that sublayer’s input. IE there might be a “target” amount of o to have in the residual stream after that sublayer, and the sublayer itself somehow responds to ensure that happens?
If there was some abnormality and there was already a bunch of o present, then the sublayer “error corrects” by shrinking its output.
I’ve heard thirdhand (?) of a transformer whose sublayers L(h)=o will dampen their outputs when o is added to that sublayer’s input. IE there might be a “target” amount of o to have in the residual stream after that sublayer, and the sublayer itself somehow responds to ensure that happens?
If there was some abnormality and there was already a bunch of o present, then the sublayer “error corrects” by shrinking its output.