Using smaller generative models as initializations for larger ones.
(The equivalent ELK proposal goes into this strategy in more detail).
Do you have a link to the ELK proposal you’re referring to here? (I tried googling for “ELK” along with the bolded text above but nothing relevant seemed to come up.)
I think [non-myopic non-optimizer is a coherent concept] - as a simple example we could imagine GPT trained for its performance over the next few timesteps. Realistically this would result in a mesa-optimizer, but in theory it could just run a very expensive version of next-token generation, over the much larger space of multiple tokens.
“Realistically this would result in a mesa-optimizer” seems like an overly confident statement? It might result in a mesa-optimizer, but unless I’ve missed something then most of our expectation of emergent mesa-optimizers is theoretical at this point.
(This is a nitpick and I also don’t mean to trivialize the inner alignment problem which I am quite worried about! But I did want to make sure I’m not missing anything here and that I’m broadly on the same page as other reasonable folks about expectations/evidence for mesa-optimizers.)
An acceptability predicate for non-agency. [...] If a model becomes agentic briefly, it could encode into its world model a deceptive super-intelligence that has its objective, before SGD guides it back into the safe zone.
That is an alarming possibility. It might require continuous or near-continuous verification of non-agency during training.
This seems like a very broad predicate, however. What would it actually look like?
I think if we could advance our interpretability tools and knowledge to where we could reliably detect mesa-optimizers, than that might suffice for this.
I’m excited, I’ve explored before how having interpretability that can both reliably detect mesa-optimizers and read-off its goals would have the potential to solve alignment. But I hadn’t considered how reliable mesa-optimization detection alone might be enough, because I wasn’t considering generative models in that post. (Even if I had, I wasn’t yet aware of some of the clever and powerful ways that generative models could be used for alignment that you describe in this post.)
How would we mechanistically incentivize something like non-agency?
I guess one of the open questions is whether generative models inherently incentivize non-agency. LLMs have achieved impressive scale without seeming to produce anything like agency. So there is some hope here. On the other hand, they are quite a ways from being complete high-fidelity world simulators, so there is a risk of emergent agency becoming natural for some reason at some point along the path to that kind of massive scale.
Do you have a link to the ELK proposal you’re referring to here?
Yep, here. I linked to it in a footnote, didn’t want redundancy in links, but probably should have anyway.
“Realistically this would result in a mesa-optimizer” seems like an overly confident statement? It might result in a mesa-optimizer, but unless I’ve missed something then most of our expectation of emergent mesa-optimizers is theoretical at this point.
Hmm, I was thinking of that under the frame of the future point where we’d worry about mesa-optimizers, I think. In that situation, I think mesa-optimizers would be more likely than not because the task is much harder to achieve good performance on (although on further thought I’m holding less strongly to this belief because of ambiguity around distance in model space between optimizers and generative models). I agree that trying to do this right now would probably just result in bad performance.
That is an alarming possibility. It might require continuous or near-continuous verification of non-agency during training.
I agree that we’ll need a strong constant gradient to prevent this (and other things), but while I think this is definitely something to fix, I’m not very worried about this possibility. Both because the model would have to be simultaneously deceptive in the brief period it’s agentic, and because this might not be a very good avenue of attack—it might be very hard to do this in a few timesteps, the world model might forget this, and simulations may operate in a way that only really “agentifies” whatever is being directly observed / amplified.
I think if we could advance our interpretability tools and knowledge to where we could reliably detect mesa-optimizers, than that might suffice for this.
I agree almost entirely—I was mainly trying to break down the exact capabilities we’d need the interpretability tools to have there. What would detecting mesa-optimizers entail mechanistically, etc.
But I hadn’t considered how reliable mesa-optimization alone might be enough, because I wasn’t considering generative models in that post.
I think this is very promising as a strategy yeah, especially because of the tilt against optimization by default—I think my main worries are getting it to work before RL reaches AGI-level.
I guess one of the open questions is whether generative models inherently incentivize non-agency. LLMs have achieved impressive scale without seeming to produce anything like agency. So there is some hope here. On the other hand, they are quite a ways from being complete high-fidelity world simulators, so there is a risk of emergent agency becoming natural for some reason at some point along the path to that kind of massive scale.
I think they have a strong bias (in a very conceptual sense) against something like agency, but larger models could end up being optimizers because that achieves greater performance past a certain scale like you said, because of different training paths—or even if it’s just pretty easy to make one an optimizer if you push it hard enough (with RL or something), that could still reduce the time we have.
Fascinating work, thanks for this post.
Do you have a link to the ELK proposal you’re referring to here? (I tried googling for “ELK” along with the bolded text above but nothing relevant seemed to come up.)
Do you have thoughts on how to achieve this predicate? I’ve written some about interpretability-based myopia verification which I think could be the key.
“Realistically this would result in a mesa-optimizer” seems like an overly confident statement? It might result in a mesa-optimizer, but unless I’ve missed something then most of our expectation of emergent mesa-optimizers is theoretical at this point.
(This is a nitpick and I also don’t mean to trivialize the inner alignment problem which I am quite worried about! But I did want to make sure I’m not missing anything here and that I’m broadly on the same page as other reasonable folks about expectations/evidence for mesa-optimizers.)
That is an alarming possibility. It might require continuous or near-continuous verification of non-agency during training.
I think if we could advance our interpretability tools and knowledge to where we could reliably detect mesa-optimizers, than that might suffice for this.
I’m excited, I’ve explored before how having interpretability that can both reliably detect mesa-optimizers and read-off its goals would have the potential to solve alignment. But I hadn’t considered how reliable mesa-optimization detection alone might be enough, because I wasn’t considering generative models in that post. (Even if I had, I wasn’t yet aware of some of the clever and powerful ways that generative models could be used for alignment that you describe in this post.)
I guess one of the open questions is whether generative models inherently incentivize non-agency. LLMs have achieved impressive scale without seeming to produce anything like agency. So there is some hope here. On the other hand, they are quite a ways from being complete high-fidelity world simulators, so there is a risk of emergent agency becoming natural for some reason at some point along the path to that kind of massive scale.
Sorry for the (very) late reply!
Yep, here. I linked to it in a footnote, didn’t want redundancy in links, but probably should have anyway.
Hmm, I was thinking of that under the frame of the future point where we’d worry about mesa-optimizers, I think. In that situation, I think mesa-optimizers would be more likely than not because the task is much harder to achieve good performance on (although on further thought I’m holding less strongly to this belief because of ambiguity around distance in model space between optimizers and generative models). I agree that trying to do this right now would probably just result in bad performance.
I agree that we’ll need a strong constant gradient to prevent this (and other things), but while I think this is definitely something to fix, I’m not very worried about this possibility. Both because the model would have to be simultaneously deceptive in the brief period it’s agentic, and because this might not be a very good avenue of attack—it might be very hard to do this in a few timesteps, the world model might forget this, and simulations may operate in a way that only really “agentifies” whatever is being directly observed / amplified.
I agree almost entirely—I was mainly trying to break down the exact capabilities we’d need the interpretability tools to have there. What would detecting mesa-optimizers entail mechanistically, etc.
I think this is very promising as a strategy yeah, especially because of the tilt against optimization by default—I think my main worries are getting it to work before RL reaches AGI-level.
I think they have a strong bias (in a very conceptual sense) against something like agency, but larger models could end up being optimizers because that achieves greater performance past a certain scale like you said, because of different training paths—or even if it’s just pretty easy to make one an optimizer if you push it hard enough (with RL or something), that could still reduce the time we have.