These all seem like reasonable reasons to doubt the hypothesized mechanism, yup. I think you’re underestimating how much can happen in a single forward pass, though—it has to be somewhat shallow, so it can’t involve too many variables, but the whole point of making the networks as large as we do these days is that it turns out an awful lot can happen in parallel. I also think there would be no reason for deception to occur if it’s never a good weight pattern to use to predict the data, it’s only if the data contains a pattern that the gradient will put into a deceptive forward mechanism that this could possibly occur. For example, if the model is trained on a bunch of humans being deceptive about their political intentions, and then RLHF is attempted.
In any case, I don’t think the old yudkowsky model of deceptive alignment is relevant, in that I think the level of deception to expect from ai should be calibrated to be around the amount you’d expect from a young human, not some super schemer god. The concern arises only when the data actually contains patterns well modeled by deception, and this would be expected to be more present in the case of something like an engagement maximizer online learning RL system.
And to be clear I don’t expect the things that can destroy humanity to arise because of deception directly. It seems much more likely to me that they’ll arise because competing people ask their model to do something that puts those models in competition in a way that puts humanity at risk, eg several different powerful competing model based engagement/sales optimizing reinforcement learners, or more speculatively something military. Something where the core problem is that alignment tech is effectively not used, and where solving this deception problem wouldn’t have saved us anyway.
Regarding the details of your descriptions: I really mainly think this sort of deception would arise in the wild when there’s a reward model passing gradients to multiple steps of a sequential model, or possibly the imitating humans locally thing. But without a reward model, nothing pushes the different steps of the sequential model towards trying to achieve the “same thing” across different steps in any significant sense. But of course almost all the really useful models involve a reward model somehow.
These all seem like reasonable reasons to doubt the hypothesized mechanism, yup. I think you’re underestimating how much can happen in a single forward pass, though—it has to be somewhat shallow, so it can’t involve too many variables, but the whole point of making the networks as large as we do these days is that it turns out an awful lot can happen in parallel. I also think there would be no reason for deception to occur if it’s never a good weight pattern to use to predict the data, it’s only if the data contains a pattern that the gradient will put into a deceptive forward mechanism that this could possibly occur. For example, if the model is trained on a bunch of humans being deceptive about their political intentions, and then RLHF is attempted.
In any case, I don’t think the old yudkowsky model of deceptive alignment is relevant, in that I think the level of deception to expect from ai should be calibrated to be around the amount you’d expect from a young human, not some super schemer god. The concern arises only when the data actually contains patterns well modeled by deception, and this would be expected to be more present in the case of something like an engagement maximizer online learning RL system.
And to be clear I don’t expect the things that can destroy humanity to arise because of deception directly. It seems much more likely to me that they’ll arise because competing people ask their model to do something that puts those models in competition in a way that puts humanity at risk, eg several different powerful competing model based engagement/sales optimizing reinforcement learners, or more speculatively something military. Something where the core problem is that alignment tech is effectively not used, and where solving this deception problem wouldn’t have saved us anyway.
Regarding the details of your descriptions: I really mainly think this sort of deception would arise in the wild when there’s a reward model passing gradients to multiple steps of a sequential model, or possibly the imitating humans locally thing. But without a reward model, nothing pushes the different steps of the sequential model towards trying to achieve the “same thing” across different steps in any significant sense. But of course almost all the really useful models involve a reward model somehow.