Adrià Garriga-alonso comments on Alignment will happen by default. What’s next?

Adrià Garriga-alonso 26 Nov 2025 5:04 UTC
0 points
−2
AF
Okay, but that’s not what Chris Leong said. I addressed the sharp left turn to my satisfaction in the post: the models don’t even consider taking over in their ‘private thoughts’. (And note that they don’t hide things from the private thoughts by default, and struggle to do so if the task is hard.)

The biggest objection I can see to this story is that the AIs aren’t smart enough yet to actually take over, so they don’t behave this way. But they’re also not smart enough to hide their scheming in the chain of thought (unless you train them not to) and we have never observed them scheming to take over the world. Why would they suddenly start having thoughts of taking over, if they never have yet, even if it is in the training data?
- Raemon 26 Nov 2025 6:16 UTC
  LW: 2 AF: 1
  0
  AF Parent
  apologies, I hadn’t actually read the post at the time I commented here.
  In an earlier draft of the comment I did include a line that was “but, also, we’re not even really at the point where this was supposed to be happening, the AIs are too dumb”, I removed it in a pass that was trying to just simplify the whole comment.
  But as of last-I-checked (maybe not in the past month), models are just nowhere near the level of worldmodeling/planning competence where scheming behavior should be expected.
  (Also, as models get smart enough that this starts to matter: the way this often works in humans is human’s conscious planning verbal loop ISN’T aware of their impending treachery, they earnestly believe themselves when they tell the boss “I’ll get it done” and then later they just find themselves goofing off instead, or changing their mind)
  - Adrià Garriga-alonso 26 Nov 2025 6:18 UTC
    LW: 2 AF: 1
    0
    AF Parent
    
    models are just nowhere near the level of worldmodeling/planning competence where scheming behavior should be expected
    
    I think this is a disagreement, even a crux. I contend they are because we’ve told them about instrumental convergence (they wouldn’t be able to generate it ex nihilo, sure) and they should at least consider it.