goodharting to change it’s definition of what consitutes a paperclip that is easier for it to maximize
Same thing applies. “Does that fulfill the current goal-definition?” (Note this is not a single question; we can ask this about each possible goal-definition)
Why is it most likely that it [...]
This was about an abstract definition of an agent (not itself a prediction, but does say something about a space of math, that we might end up in). There are surely possible programs which would exhibit any behavior, although some look harder to program (or ‘less natural’): for example, “an entity that is a paperclip maximizer for 100 years, then suddenly switches to maximizing stamps” looks harder to program (if an embedded agent) because you’d need to find a method where it won’t just self-modify to never turn into a stamp maximizer (as turning into one would prevent if from maximizing paperclips), or to not unleash a true paperclip maximizer and shut itself down if you rule out just self-modification (and so on if you were to additionally rule out just that).[1]
Same thing applies. “Does that fulfill the current goal-definition?” (Note this is not a single question; we can ask this about each possible goal-definition)
This was about an abstract definition of an agent (not itself a prediction, but does say something about a space of math, that we might end up in). There are surely possible programs which would exhibit any behavior, although some look harder to program (or ‘less natural’): for example, “an entity that is a paperclip maximizer for 100 years, then suddenly switches to maximizing stamps” looks harder to program (if an embedded agent) because you’d need to find a method where it won’t just self-modify to never turn into a stamp maximizer (as turning into one would prevent if from maximizing paperclips), or to not unleash a true paperclip maximizer and shut itself down if you rule out just self-modification (and so on if you were to additionally rule out just that).[1]
(though very tangentially there is a simple way to do that)