The solution isn’t to somehow “filter out the unwanted instrumental behavior from the wanted instrumental behavior,” though, it’s just to not imitate something that would be deceptive.
Okay, I think this helps me understand your view better. Specifically, my initial characterization of your proposals as “Imitate a (non-myopic, potentially unsafe) process X” should be amended to “Imitate a (non-myopic, but nonetheless safe) process X,” where the reason to do the imitation isn’t necessarily to buy anything extra in terms of safety, but simply efficiency.
If this is the case, however, then it raises two more questions in my view:
(less important) What work is “myopia” doing in the training process? Is the point of running a myopic imitator also just to buy efficiency (how so?), or is “myopia” somehow producing additional safety gains on top of the (already-postulated-to-exist) safety inherent to the underlying (non-deceptive) process?
(more important) It still seems to me that you are presupposing the existence of a (relatively easy-to-get) process that is simultaneously “powerful” and “non-deceptive” (“competitive and safe”, to use the same words you used in your most recent response). Again, on my view (of Eliezer’s view), this is something that is not easy to get; either the cognitive process you are working with is general enough to consider instrumental strategies (out of which deception naturally emerges as a special case), or it is not; this holds unless there is some special “filter” in place that specifically separates wanted instrumental strategies from unwanted ones. I would still like to know if/how you disagree with this, particularly as it concerns things like e.g. HCH, which (based on your previous comments) you seem to view as an example of a “competitive but safe” process.
Specifically, my initial characterization of your proposals as “Imitate a (non-myopic, potentially unsafe) process X” should be amended to “Imitate a (non-myopic, but nonetheless safe) process X,” where the reason to do the imitation isn’t necessarily to buy anything extra in terms of safety, but simply efficiency.
My model of Evan is gonna jump in here (and he can correct me if I’m wrong), see if it helps….
I like the first part, but I don’t think the “simply efficiency” part is correct.
Instead I think, actually training a model involves real-world model-training things like “running gradient descent on GPUs”. But Process X doesn’t have to involve “running gradient descent on GPUs”. Process X can be a human in the real world, or some process existing in a platonic sandbox, or whatever.
If we train a model to be myopically imitating every step of Process X, we get non-myopia in Process X’s world (e.g. the world of the human making their human plans), but we get myopia in regards to “running gradient descent on GPUs” and such.
I think Evan is using a specific sense of “deception” which is intimately related to “running gradient descent on GPUs”, so he can declare victory over (this form of) “deception”.
(Unless, I guess, instead of imitating the steps of safe non-myopic Process X, we accidentally imitate the steps of dangerous non-myopic Process Y, which is so clever that it figures out that it’s running in a simulation and tries to hack into base reality, or whatever.)
In other words, the reason to do the myopic imitation is that (non-myopic but nevertheless safe) process X is not a trained model, it’s an idea, or ideal. We want to get from there to a trained model without introducing new safety problems in the process.
(Not agreeing or disagreeing with any of this, just probing my understanding.)
Okay, I think this helps me understand your view better. Specifically, my initial characterization of your proposals as “Imitate a (non-myopic, potentially unsafe) process X” should be amended to “Imitate a (non-myopic, but nonetheless safe) process X,” where the reason to do the imitation isn’t necessarily to buy anything extra in terms of safety, but simply efficiency.
If this is the case, however, then it raises two more questions in my view:
(less important) What work is “myopia” doing in the training process? Is the point of running a myopic imitator also just to buy efficiency (how so?), or is “myopia” somehow producing additional safety gains on top of the (already-postulated-to-exist) safety inherent to the underlying (non-deceptive) process?
(more important) It still seems to me that you are presupposing the existence of a (relatively easy-to-get) process that is simultaneously “powerful” and “non-deceptive” (“competitive and safe”, to use the same words you used in your most recent response). Again, on my view (of Eliezer’s view), this is something that is not easy to get; either the cognitive process you are working with is general enough to consider instrumental strategies (out of which deception naturally emerges as a special case), or it is not; this holds unless there is some special “filter” in place that specifically separates wanted instrumental strategies from unwanted ones. I would still like to know if/how you disagree with this, particularly as it concerns things like e.g. HCH, which (based on your previous comments) you seem to view as an example of a “competitive but safe” process.
My model of Evan is gonna jump in here (and he can correct me if I’m wrong), see if it helps….
I like the first part, but I don’t think the “simply efficiency” part is correct.
Instead I think, actually training a model involves real-world model-training things like “running gradient descent on GPUs”. But Process X doesn’t have to involve “running gradient descent on GPUs”. Process X can be a human in the real world, or some process existing in a platonic sandbox, or whatever.
If we train a model to be myopically imitating every step of Process X, we get non-myopia in Process X’s world (e.g. the world of the human making their human plans), but we get myopia in regards to “running gradient descent on GPUs” and such.
I think Evan is using a specific sense of “deception” which is intimately related to “running gradient descent on GPUs”, so he can declare victory over (this form of) “deception”.
(Unless, I guess, instead of imitating the steps of safe non-myopic Process X, we accidentally imitate the steps of dangerous non-myopic Process Y, which is so clever that it figures out that it’s running in a simulation and tries to hack into base reality, or whatever.)
In other words, the reason to do the myopic imitation is that (non-myopic but nevertheless safe) process X is not a trained model, it’s an idea, or ideal. We want to get from there to a trained model without introducing new safety problems in the process.
(Not agreeing or disagreeing with any of this, just probing my understanding.)