Fabien Roger comments on A Timing Problem for Instrumental Convergence

Fabien Roger 1 Sep 2025 12:58 UTC
2 points
0
I’d note that I find quite strange all versions of non-self-preserving terminal goals that I know how to formalize. For example maximizing E[sum_t V_t(s_t)] does not result in self-preservation, but instead it results in AIs that would like to self-modify immediately to have very easy to achieve goals (if that was possible). I believe people have also tried and so far failed to come up with satisfying formalisms describing AIs that are indifferent to having their goals be modified / to being shut down.