Suppose I get hit by a meteor before I can hear your “2”—will you then have failed to tell me what 1+1 is? If so, suddenly this simple goal implies being able to save the audience from meteors. Or suppose your screen has a difficult-to-detect short circuit—your expected utility would be higher if you could check your screen and repair it if necessary.
Because a utility maximizer treats a 0.09% improvement over a 99.9% baseline just as seriously as it treats a 90% improvement over a 0% baseline, it doesn’t see these small improvements as trivial, or in any way not worth its best effort. If your goal actually has some chance of failure, and there are capabilities that might help mitigate that failure, it will incentivize capability gain. And because the real world is complicated, this seems like it’s true for basically all goals that care about the state of the world.
If we have a reinforcement learner rather than a utility maximizer with a pre-specified model of the world, this story is a bit different, because of course there will be no meteors in the training data. Now, you might think that this means that the RL agent cannot care about meteors, but this is actually somewhat undefined behavior, because the AI still gets to see observations of the world. If it is vanilla RL with no “curiosity,” it won’t ever start to care about the world until the world actually affects its reward (which for meteors, will take much too long to matter, but does become important when the reward is more informative about the real world), but if it’s more along the lines of DeepMind’s game-playing agents, then it will try to find out about the world, which will increase its rate of approaching optimal play.
There are definitely ideas in the literature that relate to this problem, particularly trying to formalize the notion that the AI shouldn’t “try too hard” on easy goals. I think these attempts mostly fall under two umbrellas—other-izers (that is, not maximizers) and impact regularization (penalizing the building of meteor-defense lasers).
Thanks again for your reply. I see your point that the world is complicated and a utility maximizer would be dangerous, even if the maximization is supposedly trivial. However, I don’t see how an achievable goal has the same problem. If my AI finds the answer of 2 before a meteor hits it, I would say it has solidly landed at 100% and stops doing anything. Your argument would be true if it decides to rule out all possible risks first, before actually starting to look for the answer of the question, which would otherwise quickly be found. But since ruling out those risks would be much harder to achieve than finding the answer, I can’t see my little agent doing that.
I think my easy goals come closest to what you call other-izers. Any more pointers for me to find that literature?
Thanks for your help, it helps me to calibrate my thoughts for sure!
I think actually 1+1 = ? is not really an easy enough goal, since it’s not 100% sure that the answer is 2. Getting to 100% certainty (including what I actually meant with that question) could still be nontrivial. But let’s say the goal is ‘delete filename.txt’? Could be the trick is in the language..
Suppose I get hit by a meteor before I can hear your “2”—will you then have failed to tell me what 1+1 is? If so, suddenly this simple goal implies being able to save the audience from meteors. Or suppose your screen has a difficult-to-detect short circuit—your expected utility would be higher if you could check your screen and repair it if necessary.
Because a utility maximizer treats a 0.09% improvement over a 99.9% baseline just as seriously as it treats a 90% improvement over a 0% baseline, it doesn’t see these small improvements as trivial, or in any way not worth its best effort. If your goal actually has some chance of failure, and there are capabilities that might help mitigate that failure, it will incentivize capability gain. And because the real world is complicated, this seems like it’s true for basically all goals that care about the state of the world.
If we have a reinforcement learner rather than a utility maximizer with a pre-specified model of the world, this story is a bit different, because of course there will be no meteors in the training data. Now, you might think that this means that the RL agent cannot care about meteors, but this is actually somewhat undefined behavior, because the AI still gets to see observations of the world. If it is vanilla RL with no “curiosity,” it won’t ever start to care about the world until the world actually affects its reward (which for meteors, will take much too long to matter, but does become important when the reward is more informative about the real world), but if it’s more along the lines of DeepMind’s game-playing agents, then it will try to find out about the world, which will increase its rate of approaching optimal play.
There are definitely ideas in the literature that relate to this problem, particularly trying to formalize the notion that the AI shouldn’t “try too hard” on easy goals. I think these attempts mostly fall under two umbrellas—other-izers (that is, not maximizers) and impact regularization (penalizing the building of meteor-defense lasers).
Thanks again for your reply. I see your point that the world is complicated and a utility maximizer would be dangerous, even if the maximization is supposedly trivial. However, I don’t see how an achievable goal has the same problem. If my AI finds the answer of 2 before a meteor hits it, I would say it has solidly landed at 100% and stops doing anything. Your argument would be true if it decides to rule out all possible risks first, before actually starting to look for the answer of the question, which would otherwise quickly be found. But since ruling out those risks would be much harder to achieve than finding the answer, I can’t see my little agent doing that.
I think my easy goals come closest to what you call other-izers. Any more pointers for me to find that literature?
Thanks for your help, it helps me to calibrate my thoughts for sure!
I think actually 1+1 = ? is not really an easy enough goal, since it’s not 100% sure that the answer is 2. Getting to 100% certainty (including what I actually meant with that question) could still be nontrivial. But let’s say the goal is ‘delete filename.txt’? Could be the trick is in the language..