[Question] Three questions about mesa-optimizers

I recently (re-)read Risks From Learned Optimization, which introduced mesa-optimizers and inner alignment. (I don’t aim to explain what these things mean in this post, but here’s Scott Alexander’s explanation from earlier today.) I have some questions about mesa-optimizers, all to the effect of “What would this actually look like?”

1. Would training a neural net on an optimization task produce a mesa-optimizer?

My understanding (correct me if I’m wrong) is that as of yet we do not have an example of mesa-optimization in practice. So, what would happen if we trained a neural net to do a task that clearly requires optimization?

For example, consider the following task: the neural net receives a polynomial P(x) (perhaps with its coefficients represented in binary as input), and the neural net is to output the value x in the interval [0, 1] such that P(x) is as large as possible. Perhaps the loss on output x is the squared difference between P(x) and the true largest value of P on [0, 1].

I’m not sure what I expect the neural net to learn to do. Maybe it would try lots of values of x and output the optimal x of the ones it tried. Maybe it would learn to take the derivative of P and do Newton’s method. But I can’t imagine it learning to do this well without learning to do some sort of search or optimization. It seems, then, that if a neural net were successfully trained on this task, it would necessarily have a mesa-optimizer.

So my question is: (a) does what I said seem right? And (b) what would happen if we actually tried this now? Are current neural network architectures (with current levels of compute) able to learn this task? Or maybe a simpler task in the same spirit?

2. What would a neural net doing search actually look like?

When explaining mesa-optimizers, I currently wave my hands and say something like “Neural networks are arbitrarily expressive, so it is possible to encode a for loop to do search as weights in a neural net.” But I’d love to actually see what this looks like, as honest-to-goodness weights. I’m not sure whether the following exercise is pretty easy or extremely hard, but consider the following program that, given n, searches for k < n^0.7 that maximizes k^2 mod n:

Input: integer n ⇐ 2^30.
k = 0
for i = 1 to n^0.7:
if i*i mod n > k*k mod n:
k = i
return k

How would one write this program as a neural net? (What do I mean by “write this program as a neural net”? I basically mean that if you spent enough time, you could explain to me what exactly the neural net is doing and I would come out of the conversation satisfied that the neural net’s weights act in such a way that it’s doing something that is equivalent to the above for loop.) How big would the neural net need to be?

(Note: as I was writing this, I noticed that the neural net would need to be really big—linear in n^0.7, exponential in the length of n encoded as a bit string. That’s because a forward pass in a neural net takes time linear in the number of connections, so a neural net isomorphic to the one above would need a number of connections linear in n^0.7. Does this mean that neural nets that do search need to be quite large, or am I missing something? What does this imply about the point at which mesa-optimization becomes more effective than memorizing heuristics?)

3. Do we need mechanistic interpretability to check whether a neural net contains a mesa-optimizer?

At first I thought that the answer is yes: how could we know the details of what a neural net is doing without looking inside? Now I’m not so sure: perhaps it’s possible to detect optimizers purely through input-output behavior. For example, suppose our neural net solves a problem for which some inputs are particularly challenging and require optimization in order to attain a low loss. If the neural net has learned an optimization algorithm, it would quite plausibly do as well on these inputs as on others. If it hasn’t, it would do worse. So maybe we can detect mesa-optimizers by comparing the neural net’s loss on “easy” inputs to its loss on “hard” inputs.

I don’t know to what extent this approach is feasible, and whether it’s pretty generally useful or only useful on some toy mathematical problems.

No comments.