Jozdien comments on Conditioning Generative Models for Alignment

Jozdien 3 Oct 2022 16:37 UTC
LW: 2 AF: 2
0
AF
Sorry for the (very) late reply!
Do you have a link to the ELK proposal you’re referring to here?
Yep, here. I linked to it in a footnote, didn’t want redundancy in links, but probably should have anyway.
“Realistically this would result in a mesa-optimizer” seems like an overly confident statement? It might result in a mesa-optimizer, but unless I’ve missed something then most of our expectation of emergent mesa-optimizers is theoretical at this point.
Hmm, I was thinking of that under the frame of the future point where we’d worry about mesa-optimizers, I think. In that situation, I think mesa-optimizers would be more likely than not because the task is much harder to achieve good performance on (although on further thought I’m holding less strongly to this belief because of ambiguity around distance in model space between optimizers and generative models). I agree that trying to do this right now would probably just result in bad performance.
That is an alarming possibility. It might require continuous or near-continuous verification of non-agency during training.
I agree that we’ll need a strong constant gradient to prevent this (and other things), but while I think this is definitely something to fix, I’m not very worried about this possibility. Both because the model would have to be simultaneously deceptive in the brief period it’s agentic, and because this might not be a very good avenue of attack—it might be very hard to do this in a few timesteps, the world model might forget this, and simulations may operate in a way that only really “agentifies” whatever is being directly observed / amplified.
I think if we could advance our interpretability tools and knowledge to where we could reliably detect mesa-optimizers, than that might suffice for this.
I agree almost entirely—I was mainly trying to break down the exact capabilities we’d need the interpretability tools to have there. What would detecting mesa-optimizers entail mechanistically, etc.
But I hadn’t considered how reliable mesa-optimization alone might be enough, because I wasn’t considering generative models in that post.
I think this is very promising as a strategy yeah, especially because of the tilt against optimization by default—I think my main worries are getting it to work before RL reaches AGI-level.
I guess one of the open questions is whether generative models inherently incentivize non-agency. LLMs have achieved impressive scale without seeming to produce anything like agency. So there is some hope here. On the other hand, they are quite a ways from being complete high-fidelity world simulators, so there is a risk of emergent agency becoming natural for some reason at some point along the path to that kind of massive scale.
I think they have a strong bias (in a very conceptual sense) against something like agency, but larger models could end up being optimizers because that achieves greater performance past a certain scale like you said, because of different training paths—or even if it’s just pretty easy to make one an optimizer if you push it hard enough (with RL or something), that could still reduce the time we have.