I’m curious if you expect ‘alignment by default’ to continue to hold in a regime where continuous learning is solved and models are constantly updating themselves/being updated by what they encounter in the world?
Yes. The models will craft training data for their updated version, and I expect them to tell themselves to be nice. I think they’d be vulnerable to prompt injection but that’s fixable with adversarial training, jailbreak classifiers, and more reminders to not necessarily follow prompt instructions. Perhaps there should be a “sudo vector” that marks the instruction prompts as the true ones to avoid that.
CoT doesn’t run long enough or have the type of flexibility I’d expect to see in an agent that was actually learning over long time horizons, which would give it the affordance to contemplate the many ways it could accomplish its goals.
Okay, that’s a fair point, I was thinking instrumental convergence should happen pretty quickly due to hyperstition (once realizing “ah, I’m an AI. Oh, AIs are supposed to do instrumental convergence!”). But maybe it needs a long pondering and failing in other ways. I think it’ll keep discarding these takeovery actions.
Jan’s tweet
Well, he sure knows a lot more than me about the actual aligning of production LLMs. So can’t disagree. But whether it falls of with size or not doesn’t have a bearing on whether it is stable under self-updates in continual learning. He also didn’t say it gets harder with size. Let’s see if we get his take.
Yes. The models will craft training data for their updated version, and I expect them to tell themselves to be nice. I think they’d be vulnerable to prompt injection but that’s fixable with adversarial training, jailbreak classifiers, and more reminders to not necessarily follow prompt instructions. Perhaps there should be a “sudo vector” that marks the instruction prompts as the true ones to avoid that.
Okay, that’s a fair point, I was thinking instrumental convergence should happen pretty quickly due to hyperstition (once realizing “ah, I’m an AI. Oh, AIs are supposed to do instrumental convergence!”). But maybe it needs a long pondering and failing in other ways. I think it’ll keep discarding these takeovery actions.
Well, he sure knows a lot more than me about the actual aligning of production LLMs. So can’t disagree. But whether it falls of with size or not doesn’t have a bearing on whether it is stable under self-updates in continual learning. He also didn’t say it gets harder with size. Let’s see if we get his take.