It’s true that classical that modern RL models don’t try to maximize their reward signal (it is the training process which attempts to maximize reward), but DeepMind tends to build systems that look a lot like AIXI approximations, conducting MCTS to maximize an approximate value function (I believe AlphaGo and MuZero both do something like this). AIXI does maximize it’s reward signal. I think it’s reasonable to expect future RL agents will engage in a kind of goal directed utility maximization which looks very similar to maximizing a reward. Wireheading may not be a relevant concern, but many other safety concern around RL agents remain relevant.
It’s true that classical that modern RL models don’t try to maximize their reward signal (it is the training process which attempts to maximize reward), but DeepMind tends to build systems that look a lot like AIXI approximations, conducting MCTS to maximize an approximate value function (I believe AlphaGo and MuZero both do something like this). AIXI does maximize it’s reward signal. I think it’s reasonable to expect future RL agents will engage in a kind of goal directed utility maximization which looks very similar to maximizing a reward. Wireheading may not be a relevant concern, but many other safety concern around RL agents remain relevant.