The fundamental idea about genes having an advantage over weights at internally implementing looping algorithms is apparently wrong though (even though I don’t understand how the contrary is possible...)
I’ve been trying to understand this myself. Here’s a the understanding I’ve come to, which is very simplistic. If someone who knows more about transformers than me says I’m wrong I will defer to them.
In order to have a mesa-optimizer, lots and lots of layers need to be in on the game of optimization, rather than just one or several key elements which gets referenced repeatedly during the optimization process.
But self-attention is, by default, not very far away from being one step in gradient descent. Every layer doesn’t need to learn to do optimization independently from scratch, since it’s relatively easy to find given the self-attention architecture.
That’s why it’s not forbiddingly difficult for neural networks to implement internal optimization algorithms. It still could be forbiddingly difficult for most optimization algorithms, ones that aren’t easy to find from the basic architecture.
if you have a more detailed grasp on how exactly self-attention is close to a gradient descent step please do let me know, i’m having a hard time making sense of the details of these papers
Note that if computing an optimization step reduces the loss, the training process will reinforce it, even if other layers aren’t doing similar steps, so this is another reason to expect more explicit optimizers.
Basically, self attention is a function of certain matrices, something like this:
ej←ej+∑hPhWh,V∑ieh,i⊗eh,iWTh,KWh,Qej
Which looks really messy when you put it like this but is pretty natural in context.
If you can get the big messy looking term to approximate a gradient descent step for a given loss function, then you’re golden.
In appendix A.1., they show the matrices that yield this gradient descent step. They are pretty simple, and probably an easy point of attraction to find.
All of this reasoning is pretty vague, and without the experimental evidence it wouldn’t be nearly good enough. So there’s definitely more to understand here. But given the experimental evidence I think this is the right story about what’s going on.
I’ve been trying to understand this myself. Here’s a the understanding I’ve come to, which is very simplistic. If someone who knows more about transformers than me says I’m wrong I will defer to them.
I used this paper to come to this understanding.
In order to have a mesa-optimizer, lots and lots of layers need to be in on the game of optimization, rather than just one or several key elements which gets referenced repeatedly during the optimization process.
But self-attention is, by default, not very far away from being one step in gradient descent. Every layer doesn’t need to learn to do optimization independently from scratch, since it’s relatively easy to find given the self-attention architecture.
That’s why it’s not forbiddingly difficult for neural networks to implement internal optimization algorithms. It still could be forbiddingly difficult for most optimization algorithms, ones that aren’t easy to find from the basic architecture.
if you have a more detailed grasp on how exactly self-attention is close to a gradient descent step please do let me know, i’m having a hard time making sense of the details of these papers
Note that if computing an optimization step reduces the loss, the training process will reinforce it, even if other layers aren’t doing similar steps, so this is another reason to expect more explicit optimizers.
Basically, self attention is a function of certain matrices, something like this:
ej←ej+∑hPhWh,V∑ieh,i⊗eh,iWTh,KWh,Qej
Which looks really messy when you put it like this but is pretty natural in context.
If you can get the big messy looking term to approximate a gradient descent step for a given loss function, then you’re golden.
In appendix A.1., they show the matrices that yield this gradient descent step. They are pretty simple, and probably an easy point of attraction to find.
All of this reasoning is pretty vague, and without the experimental evidence it wouldn’t be nearly good enough. So there’s definitely more to understand here. But given the experimental evidence I think this is the right story about what’s going on.