Well, then, let’s change from the example being Monday + to Tuesday—to Wednesday and all later times +, with it unable to actually affect paperclip counts on Tuesday, let’s consider if we just have a transition from u+ on Monday, Tuesday, Wednesday +, with u- on Thursday and later times, and it already has all the infrastructure it needs.
In this case, it will see that it can get a + score by having paperclips monday through wednesday, but that any that it still has on Thursday will count against it.
So, it will build paperclips as soon as it learns of this pattern. It will make them have a low melting point, and it will build a furnace†. On Wednesday evening at the stroke of midnight, it will dump its paperclips into the furnace. Because all along, from the very beginning, it will have wanted there to be paperclips M-W, and not after then. And on Thursday it will be happy that there were paperclips M-W, but glad that there aren’t now.
I think that the trick is getting it to submit to changes to its utility function based on what we want at that time, without trying to game it. That’s going to be much harder.
† and, if it suspects that there are paperclips out in the wild, it will begin building machines to hunt them down, and iff it’s Thursday or later, destroy them. It will do this as soon as it learns that it will eventually be a paperclip minimizer for long enough that it is worth worrying about.
Well, then, let’s change from the example being Monday + to Tuesday—to Wednesday and all later times +, with it unable to actually affect paperclip counts on Tuesday, let’s consider if we just have a transition from u+ on Monday, Tuesday, Wednesday +, with u- on Thursday and later times, and it already has all the infrastructure it needs.
In this case, it will see that it can get a + score by having paperclips monday through wednesday, but that any that it still has on Thursday will count against it.
So, it will build paperclips as soon as it learns of this pattern. It will make them have a low melting point, and it will build a furnace†. On Wednesday evening at the stroke of midnight, it will dump its paperclips into the furnace. Because all along, from the very beginning, it will have wanted there to be paperclips M-W, and not after then. And on Thursday it will be happy that there were paperclips M-W, but glad that there aren’t now.
I think that the trick is getting it to submit to changes to its utility function based on what we want at that time, without trying to game it. That’s going to be much harder.
† and, if it suspects that there are paperclips out in the wild, it will begin building machines to hunt them down, and iff it’s Thursday or later, destroy them. It will do this as soon as it learns that it will eventually be a paperclip minimizer for long enough that it is worth worrying about.