I was trying to argue that the most natural deontology-style preferences we’d aim for are relatively stable if we actually instill them. So, I think the right analogy is that you either get integrity+loyalty+honesty in a stable way, some bastardized version of them such that it isn’t in the relevant attractor basin (where the AI makes these properties more like what the human wanted), or you don’t get these things at all (possibly because the AI was scheming for longer run preferences and so it faked these things).
And I don’t buy that the loophole argument applies unless the relevant properties are substantially bastardized. I certainly agree that there exist deontological preferences that involve searching for loopholes, but these aren’t the one people wanted. Like, I agree preferences have to be robust to search, but this is sort of straightforwardly true if the way integrity is implemented is at all kinda similar to how humans implement it.
Part of my perspective is that the deontological preferences we want are relatively naturally robust to optimization pressure if faithfully implemented, so from my perspective the situation again comes down to “you get scheming”, “your behavioural tests look bad, so you try again”, “your behavioural tests look fine, and you didn’t have scheming, so you probably basically got the properties you wanted if you were somewhat careful”.
As in, I think we can at least test for the higher level preferences we want in the absence of scheming. (In a way that implies they are probably pretty robust given some carefulness, though I think the chance of things going catastropically wrong is still substantial.)
(I’m not sure if I’m communicating very clearly, but I think this is probably not worth the time to fully figure out.)
Personally, I would clearly pass on all of my reflectively endorsed deontological norms to a successor (though some of my norms are conditional on aspects of the situation like my level of intelligence and undetermined at the moment because I haven’t reflected on them, which is typically undesirable for AIs). I find the idea that you would have a reflectively endorsed deontological norm (as in, you wouldn’t self modify to remove it) that you wouldn’t pass on to a successor bizarre: what is your future self if not a successor?
I was trying to argue that the most natural deontology-style preferences we’d aim for are relatively stable if we actually instill them.
Trivial and irrelevant though if true-obedience is part of it, since that’s magic that gets you anything you can describe.
if the way integrity is implemented is at all kinda similar to how humans implement it.
How do humans implement integrity?
Part of my perspective is that the deontological preferences we want are relatively naturally robust to optimization pressure if faithfully implemented, so from my perspective the situation comes down to “you get scheming”, “your behavioural tests look bad, so you try again”, “your behavioural tests look fine, and you didn’t have scheming, so you probably basically got the properties you wanted if you were somewhat careful”.
You’re just stating that you don’t expect any reflective instability, as an agent learns and thinks over time? I’ve heard you say this kind of thing before, but haven’t heard an explanation. I’d love to hear your reasoning? In particular since it seems very different from how humans work, and intuitively surprising for any thinking machine that starts out a bit of a hacky mess like us. (I could write out an object-level argument for why reflective instability is expected, but it’d take some effort and I’d want to know that you were going to engage with it).
I was trying to argue that the most natural deontology-style preferences we’d aim for are relatively stable if we actually instill them. So, I think the right analogy is that you either get integrity+loyalty+honesty in a stable way, some bastardized version of them such that it isn’t in the relevant attractor basin (where the AI makes these properties more like what the human wanted), or you don’t get these things at all (possibly because the AI was scheming for longer run preferences and so it faked these things).
And I don’t buy that the loophole argument applies unless the relevant properties are substantially bastardized. I certainly agree that there exist deontological preferences that involve searching for loopholes, but these aren’t the one people wanted. Like, I agree preferences have to be robust to search, but this is sort of straightforwardly true if the way integrity is implemented is at all kinda similar to how humans implement it.
Part of my perspective is that the deontological preferences we want are relatively naturally robust to optimization pressure if faithfully implemented, so from my perspective the situation again comes down to “you get scheming”, “your behavioural tests look bad, so you try again”, “your behavioural tests look fine, and you didn’t have scheming, so you probably basically got the properties you wanted if you were somewhat careful”.
As in, I think we can at least test for the higher level preferences we want in the absence of scheming. (In a way that implies they are probably pretty robust given some carefulness, though I think the chance of things going catastropically wrong is still substantial.)
(I’m not sure if I’m communicating very clearly, but I think this is probably not worth the time to fully figure out.)
Personally, I would clearly pass on all of my reflectively endorsed deontological norms to a successor (though some of my norms are conditional on aspects of the situation like my level of intelligence and undetermined at the moment because I haven’t reflected on them, which is typically undesirable for AIs). I find the idea that you would have a reflectively endorsed deontological norm (as in, you wouldn’t self modify to remove it) that you wouldn’t pass on to a successor bizarre: what is your future self if not a successor?
Trivial and irrelevant though if true-obedience is part of it, since that’s magic that gets you anything you can describe.
How do humans implement integrity?
You’re just stating that you don’t expect any reflective instability, as an agent learns and thinks over time? I’ve heard you say this kind of thing before, but haven’t heard an explanation. I’d love to hear your reasoning? In particular since it seems very different from how humans work, and intuitively surprising for any thinking machine that starts out a bit of a hacky mess like us. (I could write out an object-level argument for why reflective instability is expected, but it’d take some effort and I’d want to know that you were going to engage with it).