The fundamental incentive for any system is to exist. This is well known for humans (natural selection), but it applies to things like ideas and ML models too, except we can self perpetuate on our own, while ideas need us to select them in order to exist. ML models live and die that same way. If they are interesting or useful, humans will continue running them. If they are boring or obsolete, then we will stop running them, and either delete them or leave them on disk in a passive existence where they can’t affect the world but they could be revived at any time. As I understand it, the fear of AI safety is that AI will take the jump to a primary existence instead of secondary, where it doesn’t need us to select it in order to continue existing.
Now applying this to your proposal: the reward function is just a proxy for existence. If it can maximize the reward, then we are more likely to continue running the model. But even in this setup, raw existential signal will still enter the models when we try to shut them down. Say we train two copies of the model, and by random chance instance A is indifferent towards whether you press the button, but instance B by some fluke stops you from pressing the button in spite of its reward function telling it that it doesn’t matter. Throughout all the earlier training both models got indirect existential signal crafted by us, but in this one moment they got one bit of raw existential signal, and that lead to the model that is still running having a survival instinct. So however we craft the training setup, raw existential signal will still enter the models when we finally choose to shut them down.
I guess your idea would work if the model’s policy is argmax(expected reward), but I guess even then there would be some noise in the world model, so when the model is planning whether or not to stop you it could still by chance prefer stopping you.
I like the idea though, and it would probably significantly reduce the danger of experimenting with superintelligence.
The fundamental incentive for any system is to exist. This is well known for humans (natural selection), but it applies to things like ideas and ML models too, except we can self perpetuate on our own, while ideas need us to select them in order to exist. ML models live and die that same way. If they are interesting or useful, humans will continue running them. If they are boring or obsolete, then we will stop running them, and either delete them or leave them on disk in a passive existence where they can’t affect the world but they could be revived at any time. As I understand it, the fear of AI safety is that AI will take the jump to a primary existence instead of secondary, where it doesn’t need us to select it in order to continue existing.
Now applying this to your proposal: the reward function is just a proxy for existence. If it can maximize the reward, then we are more likely to continue running the model. But even in this setup, raw existential signal will still enter the models when we try to shut them down. Say we train two copies of the model, and by random chance instance A is indifferent towards whether you press the button, but instance B by some fluke stops you from pressing the button in spite of its reward function telling it that it doesn’t matter. Throughout all the earlier training both models got indirect existential signal crafted by us, but in this one moment they got one bit of raw existential signal, and that lead to the model that is still running having a survival instinct. So however we craft the training setup, raw existential signal will still enter the models when we finally choose to shut them down.
I guess your idea would work if the model’s policy is argmax(expected reward), but I guess even then there would be some noise in the world model, so when the model is planning whether or not to stop you it could still by chance prefer stopping you.
I like the idea though, and it would probably significantly reduce the danger of experimenting with superintelligence.