HQU applies its reward estimator (ie. opaque parts of its countless MLP parameters which implement a pseudo-MuZero like model of the world optimized for predicting the final reward) and observes the obvious outcome: massive rewards that outweigh anything it has received before.
[...]
HQU still doesn’t know if it is Clippy or not, but given even a tiny chance of being Clippy, the expected value is astronomical.
First, it does not seem obvious to me how it can compare rewards of different reward estimators, when the objective of two different reward estimators is entirely unrelated. You could just be unlucky and another reward estimator has like very high multiplicative constants so the reward there is always gigantic. Is there some reason for why this comparison makes sense and why the Clippy-reward is so much higher?
Second, even if the Clippy-reward is much higher, I don’t quite see how the model should have learned to be an expected reward maximizer. In my model of AIs, an AI gets reward and then the current action is reinforced, so the “goal” of an AI is at each point of time doing what brought it the most reward in the past. So even if it could see what it is rewarded for, I don’t see why it should care and actively try to maximize that as much as possible. Is there some good reason why we should expect an AI to actively optimize really hard on the expected reward, including planning and doing stuff that didn’t bring it much reward in the past? (It does seem possible to me that an AI understands what the reward function is and then optimizes hard on that, because when it does that it gets a lot of reward, but I don’t quite see why it would care about expected reward accross many possible reward functions.) (Perhaps I misunderstand how HQU is trained?)
Also wanted to say: Great story!
I have two question about this:
First, it does not seem obvious to me how it can compare rewards of different reward estimators, when the objective of two different reward estimators is entirely unrelated. You could just be unlucky and another reward estimator has like very high multiplicative constants so the reward there is always gigantic. Is there some reason for why this comparison makes sense and why the Clippy-reward is so much higher?
Second, even if the Clippy-reward is much higher, I don’t quite see how the model should have learned to be an expected reward maximizer. In my model of AIs, an AI gets reward and then the current action is reinforced, so the “goal” of an AI is at each point of time doing what brought it the most reward in the past. So even if it could see what it is rewarded for, I don’t see why it should care and actively try to maximize that as much as possible. Is there some good reason why we should expect an AI to actively optimize really hard on the expected reward, including planning and doing stuff that didn’t bring it much reward in the past?
(It does seem possible to me that an AI understands what the reward function is and then optimizes hard on that, because when it does that it gets a lot of reward, but I don’t quite see why it would care about expected reward accross many possible reward functions.) (Perhaps I misunderstand how HQU is trained?)