I’ve had double descent as part of my talk version of “Risks from Learned Optimization” because I think it addresses a pretty important part of the story for mesa-optimization. That is, mesa-optimizers are simple, compressed policies—but as ML moves to larger and larger models, why should that matter? The answer, I think, is that larger models can generalize better not just by fitting the data better, but also by being simpler.
Optimization algorithms are simple in the sense that they loop a lot, repeating the same basic cycle to yield multiple candidate outputs. However, this isn’t necessarily a kind of simplicity that makes them more likely to emerge inside of neural networks. In basic MLPs, each iteration of an optimization loop has to be implemented individually, because each weight only gets used once per forward pass. In that setting, looping optimization algorithms actually aren’t even remotely compressible.
Transformers have somewhat better access to looping algorithms, because all of their weights get applied to every token in context, effectively constituting a loop of weight application. However, these loops are fundamentally non-iterative. At no point in the repeated weight application process do you feed in the output of a previous weight application. Instead, the repeated weight application has to be 100% parallelizable. So you still don’t have an easy way to implement fundamentally sequential optimization algorithms, such as iteratively refining a neural network across multiple gradient steps. Each sequential operation has to be implemented by different weights inside the network
RNNs are a more full-blown exception. Their repeated weight applications are inherently sequential, in the way transformers’ repeated weight applications are not.
Anyway, in any case where you need to implement each iteration of your looping algorithm individually, that’s a strike against it appearing in the course of backpropagation-driven weight updates. This is because backprop updates weights locally, without regard for how any of the other weights are being altered by a given training example. For each location in the network where a loop has to appear, the network needs to have independently converged on implementing an iteration of the loop in that location. You can’t just set up one set of weights at one location in the network and get an algorithm that gets applied repeatedly.
Optimization algorithms are simple in the sense that they loop a lot, repeating the same basic cycle to yield multiple candidate outputs. However, this isn’t necessarily a kind of simplicity that makes them more likely to emerge inside of neural networks. In basic MLPs, each iteration of an optimization loop has to be implemented individually, because each weight only gets used once per forward pass. In that setting, looping optimization algorithms actually aren’t even remotely compressible.
Transformers have somewhat better access to looping algorithms, because all of their weights get applied to every token in context, effectively constituting a loop of weight application. However, these loops are fundamentally non-iterative. At no point in the repeated weight application process do you feed in the output of a previous weight application. Instead, the repeated weight application has to be 100% parallelizable. So you still don’t have an easy way to implement fundamentally sequential optimization algorithms, such as iteratively refining a neural network across multiple gradient steps. Each sequential operation has to be implemented by different weights inside the network
RNNs are a more full-blown exception. Their repeated weight applications are inherently sequential, in the way transformers’ repeated weight applications are not.
Anyway, in any case where you need to implement each iteration of your looping algorithm individually, that’s a strike against it appearing in the course of backpropagation-driven weight updates. This is because backprop updates weights locally, without regard for how any of the other weights are being altered by a given training example. For each location in the network where a loop has to appear, the network needs to have independently converged on implementing an iteration of the loop in that location. You can’t just set up one set of weights at one location in the network and get an algorithm that gets applied repeatedly.