Good point! Overall I don’t anticipate these layers will give you much control over what the network ends up optimizing for, but I don’t fully understand them yet either, so maybe you’re right.
Do you have specific reason to think moding the layers will easily let you control the high-level behavior, or is it just a justified hunch?
Not in isolation, but that’s just because characterizing the ultimate goal / optimization target of a system is way too difficult for the field right now. I think the important question is whether interp brings us closer such that in conjunction with more theory and/or the ability to iterate, we can get some alignment and/or corrigibility properties.
I haven’t read the paper and I’m not claiming that this will be counterfactual to some huge breakthrough, but understanding in-context learning algorithms definitely seems like a piece of the puzzle. To give a fanciful story from my skim, the paper says that the model constructs an internal training set. Say we have a technique to excise power-seeking behavior from models by removing the influence of certain training examples. If the model’s mesa-optimization algorithms operate differently, our technique might not work until we understand this and adapt the technique. Or we can edit the internal training set directly rather than trying to indirectly influence it.
@TurnTrout@cfoster0 you two were skeptical. What do you make of this? They explicitly build upon the copying heads work Anthropic’s interp team has been doing.
Look at that! People have used interpretability to make a mesa layer! https://arxiv.org/pdf/2309.05858.pdf
This might do more for alignment. Better that we understand mesa-optimization and can engineer it than have it mysteriously emerge.
Good point! Overall I don’t anticipate these layers will give you much control over what the network ends up optimizing for, but I don’t fully understand them yet either, so maybe you’re right.
Do you have specific reason to think moding the layers will easily let you control the high-level behavior, or is it just a justified hunch?
Not in isolation, but that’s just because characterizing the ultimate goal / optimization target of a system is way too difficult for the field right now. I think the important question is whether interp brings us closer such that in conjunction with more theory and/or the ability to iterate, we can get some alignment and/or corrigibility properties.
I haven’t read the paper and I’m not claiming that this will be counterfactual to some huge breakthrough, but understanding in-context learning algorithms definitely seems like a piece of the puzzle. To give a fanciful story from my skim, the paper says that the model constructs an internal training set. Say we have a technique to excise power-seeking behavior from models by removing the influence of certain training examples. If the model’s mesa-optimization algorithms operate differently, our technique might not work until we understand this and adapt the technique. Or we can edit the internal training set directly rather than trying to indirectly influence it.
Evan Hubinger: In my paper, I theorized about the mesa optimizer as a cautionary tale
Capabilities researchers: At long last, we have created the Mesa Layer from classic alignment paper Risks From Learned Optimization (Hubinger, 2019).
@TurnTrout @cfoster0 you two were skeptical. What do you make of this? They explicitly build upon the copying heads work Anthropic’s interp team has been doing.
As garrett says—not clear that this work is net negative. Skeptical that it’s strongly net negative. Haven’t read deeply, though.