A similar objection is that you might accidentially define the utility function and time limit in such a way that the AI assigns positive probability to the hypothesis that it can later create a time machine and go back and improve the utility. Then once the time has passed, it will desparately try to invent a time machine, even if it thinks it is extremely unlikely to succed (this is using Bostrom’s way of thinking. Shard theory would not predict this).
The τ1⋅(ˆR1−R1–––) bound on how much there is to gain from creating a time machine and improving past utility is outweighed by the τ1⋅(ˆR1−R1–––)⋅C reward from R2 for shutting down.
Every RL algorithm I’ve heard of implicitly bakes in an assumption that past utility is unmodifiable. I guess all bets are off with mesa-optimisers, but personally I’d bet against even mesa-optimisers in model-free RL behaving as if past utility is up for grabs.
A similar objection is that you might accidentially define the utility function and time limit in such a way that the AI assigns positive probability to the hypothesis that it can later create a time machine and go back and improve the utility. Then once the time has passed, it will desparately try to invent a time machine, even if it thinks it is extremely unlikely to succed (this is using Bostrom’s way of thinking. Shard theory would not predict this).
I disagree, for two reasons:
The τ1⋅(ˆR1−R1–––) bound on how much there is to gain from creating a time machine and improving past utility is outweighed by the τ1⋅(ˆR1−R1–––)⋅C reward from R2 for shutting down.
Every RL algorithm I’ve heard of implicitly bakes in an assumption that past utility is unmodifiable. I guess all bets are off with mesa-optimisers, but personally I’d bet against even mesa-optimisers in model-free RL behaving as if past utility is up for grabs.