Jeremy Gillen comments on Foom & Doom 2: Technical alignment is hard

Jeremy Gillen 26 Jun 2025 9:55 UTC
LW: 4 AF: 1
1
AF
In this situation, I think a reasonable person who actually values integrity in this way (we could name some names) would be pretty reasonable or would at least note that they wouldn’t robustly pursue the interests of the developer. That’s not to say they would necessarily align their successor, but I think they would try to propagate their nonconsequentialist preferences due to these instructions.
Yes, agreed. The extra machinery and assumptions you describe seem sufficient to make sure nonconsequentialist preferences are passed to a successor.
I think an actually high integrity person/AI doesn’t search for loopholes or want to search for loopholes.
If I try to condition on the assumptions that you’re using (which I think include a central part of the AIs preferences having a true-but-maybe-approximate pointer toward the instruction-givers preferences, and also involves a desire to defer or at least flag relevant preference differences) then I agree that such an AI would not search for loopholes on the object-level.
I’m not sure whether you missed the straightforward point I was trying to make about searching for loopholes, or whether you understand it and are trying to point at a more relevant-to-your-models scenario? The straightforward point was that preference-like objects need to be robust to search. Your response reads as “imagine we have a bunch of higher-level-preferences and protective machinery that already are robust to optimisation, then on the object level these can reduce the need for robustness”. This is locally valid.
I don’t think its relevant because we don’t know how to build those higher-level-preferences and protective machinery in a way that is itself very robust to the OOD push that comes from scaling up intelligence, learning, self-correcting biases, and increased option-space.
(I don’t think disgust is an example of a deontological constraint, it’s just an obviously unendorsed physical impulse!)
Some people reflectively endorse their own disgust at picking up insects, and wouldn’t remove it if given the option. I wanted an example of a pure non-consequentialist preference, and I stand by it as a good example.
deontological constraints we want are like the human notions of integrity, loyalty, and honesty
Probably we agree about this, but for the sake of flagging potential sources of miscommunication: if I think about the machinery involved in implementing these “deontological” constraints, there’s a lot of consequentialist machinery involved (but it’s mostly shorter-term and more local than normal consequentialist preferences).
- ryan_greenblatt 26 Jun 2025 14:59 UTC
  LW: 2 AF: 2
  0
  AF Parent
  I was trying to argue that the most natural deontology-style preferences we’d aim for are relatively stable if we actually instill them. So, I think the right analogy is that you either get integrity+loyalty+honesty in a stable way, some bastardized version of them such that it isn’t in the relevant attractor basin (where the AI makes these properties more like what the human wanted), or you don’t get these things at all (possibly because the AI was scheming for longer run preferences and so it faked these things).
  
  And I don’t buy that the loophole argument applies unless the relevant properties are substantially bastardized. I certainly agree that there exist deontological preferences that involve searching for loopholes, but these aren’t the one people wanted. Like, I agree preferences have to be robust to search, but this is sort of straightforwardly true if the way integrity is implemented is at all kinda similar to how humans implement it.
  
  Part of my perspective is that the deontological preferences we want are relatively naturally robust to optimization pressure if faithfully implemented, so from my perspective the situation again comes down to “you get scheming”, “your behavioural tests look bad, so you try again”, “your behavioural tests look fine, and you didn’t have scheming, so you probably basically got the properties you wanted if you were somewhat careful”.
  
  As in, I think we can at least test for the higher level preferences we want in the absence of scheming. (In a way that implies they are probably pretty robust given some carefulness, though I think the chance of things going catastropically wrong is still substantial.)
  
  (I’m not sure if I’m communicating very clearly, but I think this is probably not worth the time to fully figure out.)
  
  Personally, I would clearly pass on all of my reflectively endorsed deontological norms to a successor (though some of my norms are conditional on aspects of the situation like my level of intelligence and undetermined at the moment because I haven’t reflected on them, which is typically undesirable for AIs). I find the idea that you would have a reflectively endorsed deontological norm (as in, you wouldn’t self modify to remove it) that you wouldn’t pass on to a successor bizarre: what is your future self if not a successor?
  - Jeremy Gillen 26 Jun 2025 18:01 UTC
    2 points
    0
    Parent
    I was trying to argue that the most natural deontology-style preferences we’d aim for are relatively stable if we actually instill them.
    Trivial and irrelevant though if true-obedience is part of it, since that’s magic that gets you anything you can describe.
    if the way integrity is implemented is at all kinda similar to how humans implement it.
    How do humans implement integrity?
    Part of my perspective is that the deontological preferences we want are relatively naturally robust to optimization pressure if faithfully implemented, so from my perspective the situation comes down to “you get scheming”, “your behavioural tests look bad, so you try again”, “your behavioural tests look fine, and you didn’t have scheming, so you probably basically got the properties you wanted if you were somewhat careful”.
    You’re just stating that you don’t expect any reflective instability, as an agent learns and thinks over time? I’ve heard you say this kind of thing before, but haven’t heard an explanation. I’d love to hear your reasoning? In particular since it seems very different from how humans work, and intuitively surprising for any thinking machine that starts out a bit of a hacky mess like us. (I could write out an object-level argument for why reflective instability is expected, but it’d take some effort and I’d want to know that you were going to engage with it).