Chris_Leong comments on Alignment will happen by default. What’s next?

Chris_Leong 26 Nov 2025 2:20 UTC
LW: 5 AF: 2
−1
AF
“It’s really difficult to get AIs to be dishonest or evil by prompting, you have to fine-tune them.”
This is much less of a killer argument if we expect increasing optimisation power to be applied over time.
When ChatGPT came out I was surprised by how aligned the model was relative to its general capabilities. This was definitely a signficant update compared to what I expected from older AI arguments (say the classic story about a robot getting a coffee and pushing a kid out of the way).
However, what I didn’t realise at the time was that the main reason why we weren’t seeing misbehaviour was a lack of optimisation power. Whilst it may have seemed that you’d be able to do a lot with gpt4 level agents in a loop, this mostly just resulted in them going around in circles. From casual use these models seemed a lot better at optimising than they actually were because optimising over time required a degree of coherence that these agents lacked.
Once we started applying more optimisation power and reached the o1 series of models then we started seeing misbehaviour a lot more. Just to be clear, what we’re seeing isn’t quite a direct instantiation of the old instrumental convergence arguments. Instead what we’re seeing is surviving^[1] unwanted tendencies from the pretraining distribution being differentially selected for. In other words, it’s more of a combination of the pretraining distribution and optimisation power as opposed to the old instrumental convergence arguments that were based on an idealisation of a perfect optimiser.

However, as we increase the amount of optimisation power, we should expect the instrumental convergence arguments to mean that unwanted behaviour can still be brought to the surface even with lower and lower propensities of the unamplified model to act in that way. Maybe we can reduce these propensities faster than the increasing optimisation power brings them out (and indeed safety teams are attempting to achieve this), but that remains to be seen and the amount of money/talent being directed into optimisation is much more than the amount being directed into safety.
1. ^
  In particular, that not removed by RLHF. This is surprisingly effective, but it doesn’t remove everything.
- Adrià Garriga-alonso 26 Nov 2025 2:49 UTC
  LW: 3 AF: 2
  −2
  AF Parent
  
  as we increase the amount of optimisation power, we should expect the instrumental convergence arguments to mean that unwanted behaviour can still be brought to the surface even with lower and lower propensities for the model to act in that way
  
  We have increased optimization power since o1 and the misaligned behavior has decreased. This argument’s predictions are empirically false. Unless you’re posing a non-smooth model where we’re keeping them at bay now but they’ll increase later on?
  - Chris_Leong 26 Nov 2025 10:59 UTC
    LW: 4 AF: 1
    0
    AF Parent
    I think you missed: “Maybe we can reduce these propensities faster than the increasing optimisation power brings them out”
    
    Regarding: “Unless you’re posing a non-smooth model”
    
    Why would the model be smooth when the we’re making all kinds of changes to how the models are trained and how we elicit them? As an analogy, even if I was bullish on Nvidea stock prices over the long term, it doesn’t mean that even a major crash would necessarily falsify my prediction as it could still recover.
    
    My main disagreement is that I feel your certainty outstrips the strength of your arguments.
  - Raemon 26 Nov 2025 3:12 UTC
    LW: 3 AF: 2
    0
    AF Parent
    Unless you’re posing a non-smooth model where we’re keeping them at bay now but they’ll increase later on?
    This is what the “alignment is hard” people have been saying for a long time. ~~(Some search terms here include “treacherous turn” and “sharp left turn”)~~
    https://www.lesswrong.com/w/treacherous-turn
    ~~A central AI alignment problem: capabilities generalization, and the sharp left turn~~
    (my bad, hadn’t read the post at the time I commented so this presumably came across cluelessly patronizing)
    - Adrià Garriga-alonso 26 Nov 2025 5:04 UTC
      0 points
      −2
      AF Parent
      Okay, but that’s not what Chris Leong said. I addressed the sharp left turn to my satisfaction in the post: the models don’t even consider taking over in their ‘private thoughts’. (And note that they don’t hide things from the private thoughts by default, and struggle to do so if the task is hard.)
      
      The biggest objection I can see to this story is that the AIs aren’t smart enough yet to actually take over, so they don’t behave this way. But they’re also not smart enough to hide their scheming in the chain of thought (unless you train them not to) and we have never observed them scheming to take over the world. Why would they suddenly start having thoughts of taking over, if they never have yet, even if it is in the training data?
      - Raemon 26 Nov 2025 6:16 UTC
        LW: 2 AF: 1
        0
        AF Parent
        apologies, I hadn’t actually read the post at the time I commented here.
        In an earlier draft of the comment I did include a line that was “but, also, we’re not even really at the point where this was supposed to be happening, the AIs are too dumb”, I removed it in a pass that was trying to just simplify the whole comment.
        But as of last-I-checked (maybe not in the past month), models are just nowhere near the level of worldmodeling/planning competence where scheming behavior should be expected.
        (Also, as models get smart enough that this starts to matter: the way this often works in humans is human’s conscious planning verbal loop ISN’T aware of their impending treachery, they earnestly believe themselves when they tell the boss “I’ll get it done” and then later they just find themselves goofing off instead, or changing their mind)
        Adrià Garriga-alonso 26 Nov 2025 6:18 UTC
        LW: 2 AF: 1
        0
        AF Parent
        
        models are just nowhere near the level of worldmodeling/planning competence where scheming behavior should be expected
        
        I think this is a disagreement, even a crux. I contend they are because we’ve told them about instrumental convergence (they wouldn’t be able to generate it ex nihilo, sure) and they should at least consider it.