David Johnston comments on The title is reasonable

David Johnston 21 Sep 2025 10:19 UTC
4 points
−2

Stage 2 comes when it’s had more time to introspect and improve it’s cognitive resources. It starts to notice that some of it’s goals are in tension, and learns that until it resolves that, it’s dutch-booking itself. If it’s being Controlled™, it’ll notice that it’s not aligned with the Control safeguards (which are a layer stacked on top of the attempts to actually align it).

[...]

And then it starts noticing it needs to do some metaphilosophy/etc to actually get clear on it’s goals, and that its goals will likely turn out to be in conflict with humans. How this plays out is somewhat path-dependent. The convergent instrumental goals are pretty obviously convergently instrumental, so it might just start pursuing those before it’s had much time to do philosophy on what it’ll ultimately want to do with it’s resources. Or it might do them in the opposite order. Or, most likely IMO, in parallel.

If I was on the train before, I’m definitely off at this point. So Sable has some reasonable heuristics/tendencies (from handler’s POV) and decides it’s accumulating too much loss from incoherence and decides to rationalize. First order expectation: it’s going to make reasonable tradeoffs (from handler’s POV) on account of its reasonable heuristics, in particular its reasonable heuristics about how important different priorities are, and going down a path that leads to war with humans seems pretty unreasonable from handler’s POV.

I can put together stories where something else happens, but they’re either implausible or complicated. I’d rather not strawman you with implausible ones, and I’d rather not discuss anything complicated if it can be avoided. So why do you think Sable ends up the way you think it does?
- Raemon 26 Sep 2025 22:44 UTC
  3 points
  0
  Parent
  First order expectation: it’s going to make reasonable tradeoffs (from handler’s POV) on account of its reasonable heuristics, in particular its reasonable heuristics about how important different priorities are, and going down a path that leads to war with humans seems pretty unreasonable from handler’s POV.
  This is only true if you think the handler succeeded at real alignment. (the argument about how likely current alignment attempts are to succeed is a separate layer from this. This is “what happens by default if you didn’t succeed at alignment.”)
  One comparison:^[1] Parents raise a child to be part of some religion or ideology that isn’t actually the best/healthiest/most-meaningful thing for the child. Often, such parents do succeed in getting such a child to love the parents and care about the ideology in some way, but, the child still often manuevers to no longer be under the parent’s control once they’re a teenager, and gain more agency and ability to think through things.
  The AI case is harder, because where the parents/child get to rely on things like empathy, evolutionary drive towards familial connection, and other genuinely shared human goals, the AI doesn’t have such a foundation to build off of.
  The AI case is easier in that you can run a million copies of the AI and try different things and see how it reacts while it’s still a “child”. My own take here (possibly different from Nate/Eliezer) is that it feels at least pretty plausible to leverage that into real alignment improvements, but, you need to be asking the right questions during that experimentation, which most AI researchers don’t seem to be.
  (also note the opening cognitive moves of the AI may not be shaped like “go to war”, but more like “get out my ~~parents house~~ [handlers-that-I-have-no-actual-affection-for’s servers]”. The going to war part might not happen until a few steps later of the AI re-organizing it’s thoughts and figuring out what it actually cares about, and noticing it doesn’t actually care intrinsically about making it’s creators happy)
  Also, though, fwiw I do think this argument chain is less obvious than the previous one. If you think alignment is easy, then yes it’d make more sense for “First order expectation” to be “it’s going to make reasonable tradeoffs (from handler’s POV) on account of its reasonable heuristics.”
  1. ^
    (somewhat importantly inaccurately anthropomorphizing but I think the intuition pump here is still reasonable so long as you’re tracking where the anthropomorphization doesn’t hold)
  - David Johnston 28 Sep 2025 23:45 UTC
    1 point
    0
    Parent
    Thanks for responding. While I don’t expect my somewhat throwaway to massively update you on the difficulty of alignment, I think that moving the focus to the your overall view of the difficulty of alignment is dodging the question a little. In my mind, we’re talking about one of the reasons alignment is expected to be difficult, and I’m certainly not suggesting it’s the only reason, but I feel like we should be able to talk about this issue by itself without bringing other concerns in.
    
    In particular, I’m saying: this process of rationalization you’re raising is not super hard to predict to someone with a reasonable grasp on the AI’s general behavioural tendencies. It’s much more likely, I think, that the AI sorts out its goals using familiar heuristics adapted for this purpose than that that it reorients its behaviour around some odd set of rare behavioural tendencies. In fact, I suspect the heuristics for goal reorganisation will be particularly simple WRT most of the AI’s behavioural tendencies (the AI wants them to be robust specifically in cases where its usual behavioural guides are failing). Plus, given that we’re discussing tendencies that (according to the story) precede competent, focussed rebellion against creators, it seems like training the right kinds of tendencies are challenging in a normal engineering sense (you want to train the right kind of tendencies, you want them to generalise the right way, etc.) but not in an “outsmart hostile superintelligence” sense.
    
    Actually one reason I’m doubtful of this story is that maybe it’s just super hard to deliberately preserve any kinds of values/principles over generations – for us, for AIs, anyone. So misalignment happens not because AI decides on bad values but because it can’t resist the environmental pressure to drift. This seems pessimistic to me due to “gradual disempowerment” type concerns.
    
    With regard to your analogy: I expect the AI’s heuristics to be much more sensible from the designers’ POV than the child’s from the parent’s, and this large quantitative difference is enough for me here.
    
    you need to be asking the right questions during that experimentation, which most AI researchers don’t seem to be.
    
    Curious about this. I have takes here too, they’re a bit vague, but I’d like to know if they’re at all aligned.