Specifically, my initial characterization of your proposals as “Imitate a (non-myopic, potentially unsafe) process X” should be amended to “Imitate a (non-myopic, but nonetheless safe) process X,” where the reason to do the imitation isn’t necessarily to buy anything extra in terms of safety, but simply efficiency.
My model of Evan is gonna jump in here (and he can correct me if I’m wrong), see if it helps….
I like the first part, but I don’t think the “simply efficiency” part is correct.
Instead I think, actually training a model involves real-world model-training things like “running gradient descent on GPUs”. But Process X doesn’t have to involve “running gradient descent on GPUs”. Process X can be a human in the real world, or some process existing in a platonic sandbox, or whatever.
If we train a model to be myopically imitating every step of Process X, we get non-myopia in Process X’s world (e.g. the world of the human making their human plans), but we get myopia in regards to “running gradient descent on GPUs” and such.
I think Evan is using a specific sense of “deception” which is intimately related to “running gradient descent on GPUs”, so he can declare victory over (this form of) “deception”.
(Unless, I guess, instead of imitating the steps of safe non-myopic Process X, we accidentally imitate the steps of dangerous non-myopic Process Y, which is so clever that it figures out that it’s running in a simulation and tries to hack into base reality, or whatever.)
In other words, the reason to do the myopic imitation is that (non-myopic but nevertheless safe) process X is not a trained model, it’s an idea, or ideal. We want to get from there to a trained model without introducing new safety problems in the process.
(Not agreeing or disagreeing with any of this, just probing my understanding.)
My model of Evan is gonna jump in here (and he can correct me if I’m wrong), see if it helps….
I like the first part, but I don’t think the “simply efficiency” part is correct.
Instead I think, actually training a model involves real-world model-training things like “running gradient descent on GPUs”. But Process X doesn’t have to involve “running gradient descent on GPUs”. Process X can be a human in the real world, or some process existing in a platonic sandbox, or whatever.
If we train a model to be myopically imitating every step of Process X, we get non-myopia in Process X’s world (e.g. the world of the human making their human plans), but we get myopia in regards to “running gradient descent on GPUs” and such.
I think Evan is using a specific sense of “deception” which is intimately related to “running gradient descent on GPUs”, so he can declare victory over (this form of) “deception”.
(Unless, I guess, instead of imitating the steps of safe non-myopic Process X, we accidentally imitate the steps of dangerous non-myopic Process Y, which is so clever that it figures out that it’s running in a simulation and tries to hack into base reality, or whatever.)
In other words, the reason to do the myopic imitation is that (non-myopic but nevertheless safe) process X is not a trained model, it’s an idea, or ideal. We want to get from there to a trained model without introducing new safety problems in the process.
(Not agreeing or disagreeing with any of this, just probing my understanding.)