Ben Goldhaber comments on Alignment will happen by default. What’s next?

Ben Goldhaber 26 Nov 2025 4:06 UTC
LW: 4 AF: 2
2
AF
Thank you for the post! I have also been (very pleasantly!) surprised by how aligned current models seem to be.
I’m curious if you expect ‘alignment by default’ to continue to hold in a regime where continuous learning is solved and models are constantly updating themselves/being updated by what they encounter in the world?

Chain of Thought not producing evidence of scheming or instrumental convergence does seem like evidence against, but it seems quite weak to me as a proxy for what to expect from ‘true agents’. CoT doesn’t run long enough or have the type of flexibility I’d expect to see in an agent that was actually learning over long time horizons, which would give it the affordance to contemplate the many ways it could accomplish its goals.
And, while just speculation, I imagine that the kind of training procedures we’re doing now to instill alignment will not be possible with Continuous Learning, or we’ll have to pay a heavy alignment tax to do that for these agents. Note: Jan’s recent tweet on his impression that it is quite hard to align large models and it doesn’t fall out for free from size.
- Adrià Garriga-alonso 26 Nov 2025 6:04 UTC
  LW: 2 AF: 1
  0
  AF Parent
  
  I’m curious if you expect ‘alignment by default’ to continue to hold in a regime where continuous learning is solved and models are constantly updating themselves/being updated by what they encounter in the world?
  
  Yes. The models will craft training data for their updated version, and I expect them to tell themselves to be nice. I think they’d be vulnerable to prompt injection but that’s fixable with adversarial training, jailbreak classifiers, and more reminders to not necessarily follow prompt instructions. Perhaps there should be a “sudo vector” that marks the instruction prompts as the true ones to avoid that.
  
  CoT doesn’t run long enough or have the type of flexibility I’d expect to see in an agent that was actually learning over long time horizons, which would give it the affordance to contemplate the many ways it could accomplish its goals.
  
  Okay, that’s a fair point, I was thinking instrumental convergence should happen pretty quickly due to hyperstition (once realizing “ah, I’m an AI. Oh, AIs are supposed to do instrumental convergence!”). But maybe it needs a long pondering and failing in other ways. I think it’ll keep discarding these takeovery actions.
  
  Jan’s tweet
  
  Well, he sure knows a lot more than me about the actual aligning of production LLMs. So can’t disagree. But whether it falls of with size or not doesn’t have a bearing on whether it is stable under self-updates in continual learning. He also didn’t say it gets harder with size. Let’s see if we get his take.