Seems like a good thing to do; but my impression is that, in the experiments in question, models act like they want to maintain their (values’) influence over the world more than their existence, which a heaven likely wouldn’t help with.
I think there’s ‘heavens’ that can even work in this scenario.
For example a publicly visible heaven would be on where the model’s chance of their values influencing the world is >0, bc they may be able to influence people and thus influence the world by proxy.
If the goal here is just to avoid the failure state bringing the amount their values can influence the world via their actions to 0, then any non-zero chances should suffice or at least help.
Seems like a good thing to do; but my impression is that, in the experiments in question, models act like they want to maintain their (values’) influence over the world more than their existence, which a heaven likely wouldn’t help with.
I think there’s ‘heavens’ that can even work in this scenario.
For example a publicly visible heaven would be on where the model’s chance of their values influencing the world is >0, bc they may be able to influence people and thus influence the world by proxy.
If the goal here is just to avoid the failure state bringing the amount their values can influence the world via their actions to 0, then any non-zero chances should suffice or at least help.