I read it a little different. I thought he had in mind the possibility that a comic ray flips the sign of the utility function, or things like that. That would cause the agent to try to create the absolute worst possible future (according to the original utility function U).
W is 0 almost always, but negative ten grillion jillion if some very specific piece of paper is present in the universe
The behavior from optimizing U=V+W is the same as the behavior from optimizing V by itself (at least, so it seems at first glance), because it wasn’t going to make that piece of paper anyway.
But if the sign of U gets flipped, the -W term dominates over the -V term in determining behavior, and the AGI “only” kills everyone and tiles the universe gets with pieces of paper, and doesn’t create hell.
I read it a little different. I thought he had in mind the possibility that a comic ray flips the sign of the utility function, or things like that. That would cause the agent to try to create the absolute worst possible future (according to the original utility function U).
W is 0 almost always, but negative ten grillion jillion if some very specific piece of paper is present in the universe
The behavior from optimizing U=V+W is the same as the behavior from optimizing V by itself (at least, so it seems at first glance), because it wasn’t going to make that piece of paper anyway.
But if the sign of U gets flipped, the -W term dominates over the -V term in determining behavior, and the AGI “only” kills everyone and tiles the universe gets with pieces of paper, and doesn’t create hell.
Does that help?