The τ1⋅(ˆR1−R1–––) bound on how much there is to gain from creating a time machine and improving past utility is outweighed by the τ1⋅(ˆR1−R1–––)⋅C reward from R2 for shutting down.
Every RL algorithm I’ve heard of implicitly bakes in an assumption that past utility is unmodifiable. I guess all bets are off with mesa-optimisers, but personally I’d bet against even mesa-optimisers in model-free RL behaving as if past utility is up for grabs.
I disagree, for two reasons:
The τ1⋅(ˆR1−R1–––) bound on how much there is to gain from creating a time machine and improving past utility is outweighed by the τ1⋅(ˆR1−R1–––)⋅C reward from R2 for shutting down.
Every RL algorithm I’ve heard of implicitly bakes in an assumption that past utility is unmodifiable. I guess all bets are off with mesa-optimisers, but personally I’d bet against even mesa-optimisers in model-free RL behaving as if past utility is up for grabs.