Whether it’s horrible or not depends on the X, right? Which I didn’t really define, and which is a whole can of worms (reward misgeneralization, use/mention distinction, etc.). (Like, if we send RL reward based on stock market price, is claim 2 talking about “trained model explicitly wants high stock price”, or “trained model explicitly wants high reward function output”, or “we get to define X to be either of those and many more things besides”? I didn’t really specify.)
But yeah, for X’s that we can plausibly operationalize into a traditional RL reward function†, then I’m inclined to say that we probably don’t want an AI that explicitly wants X. I think we’re in agreement on that.
† (as opposed to weirder things like “the reward depends in part on the AI’s thought process, and not just its outputs”)
Whether it’s horrible or not depends on the X, right? Which I didn’t really define, and which is a whole can of worms (reward misgeneralization, use/mention distinction, etc.). (Like, if we send RL reward based on stock market price, is claim 2 talking about “trained model explicitly wants high stock price”, or “trained model explicitly wants high reward function output”, or “we get to define X to be either of those and many more things besides”? I didn’t really specify.)
But yeah, for X’s that we can plausibly operationalize into a traditional RL reward function†, then I’m inclined to say that we probably don’t want an AI that explicitly wants X. I think we’re in agreement on that.
† (as opposed to weirder things like “the reward depends in part on the AI’s thought process, and not just its outputs”)