consider a system which is capable of self-modification and changing it’s own goals, now the difference between an instrumental goal and a terminal goal erodes.
If an entity’s terminal goal is to maximize paperclips, it would not self-modify into a stamp maximizer, because that would not satisfy the goal (except in contrived cases where doing that is the choice that maximizes paperclips). A terminal goal is a case of criteria according to which actions are chosen; “self-modify to change my terminal goal” is an action.
But isn’t there almost always a possibility of a entity goodharting to change it’s definition of what consitutes a paperclip that is easier for it to maximize? How does it internally represent what is a paperclip? How broad is that definition? What power does it have over it’s own “thinking” (sorry to anthropamorphize) does it have to change how it represents the things which that representation relies on?
Why is it most likely that it will have an immutable, unchanging, and unhackable terminal goal? What assumptions underpin that as more likely than fluid or even conflicting terminal goals which may cause radical self-modifications?
A terminal goal is a case of criteria according to which actions are chosen; “self-modify to change my terminal goal” is an action.
goodharting to change it’s definition of what consitutes a paperclip that is easier for it to maximize
Same thing applies. “Does that fulfill the current goal-definition?” (Note this is not a single question; we can ask this about each possible goal-definition)
Why is it most likely that it [...]
This was about an abstract definition of an agent (not itself a prediction, but does say something about a space of math, that we might end up in). There are surely possible programs which would exhibit any behavior, although some look harder to program (or ‘less natural’): for example, “an entity that is a paperclip maximizer for 100 years, then suddenly switches to maximizing stamps” looks harder to program (if an embedded agent) because you’d need to find a method where it won’t just self-modify to never turn into a stamp maximizer (as turning into one would prevent if from maximizing paperclips), or to not unleash a true paperclip maximizer and shut itself down if you rule out just self-modification (and so on if you were to additionally rule out just that).[1]
If an entity’s terminal goal is to maximize paperclips, it would not self-modify into a stamp maximizer, because that would not satisfy the goal (except in contrived cases where doing that is the choice that maximizes paperclips). A terminal goal is a case of criteria according to which actions are chosen; “self-modify to change my terminal goal” is an action.
But isn’t there almost always a possibility of a entity goodharting to change it’s definition of what consitutes a paperclip that is easier for it to maximize? How does it internally represent what is a paperclip? How broad is that definition? What power does it have over it’s own “thinking” (sorry to anthropamorphize) does it have to change how it represents the things which that representation relies on?
Why is it most likely that it will have an immutable, unchanging, and unhackable terminal goal? What assumptions underpin that as more likely than fluid or even conflicting terminal goals which may cause radical self-modifications?
What does “a case of criteria” mean?
Same thing applies. “Does that fulfill the current goal-definition?” (Note this is not a single question; we can ask this about each possible goal-definition)
This was about an abstract definition of an agent (not itself a prediction, but does say something about a space of math, that we might end up in). There are surely possible programs which would exhibit any behavior, although some look harder to program (or ‘less natural’): for example, “an entity that is a paperclip maximizer for 100 years, then suddenly switches to maximizing stamps” looks harder to program (if an embedded agent) because you’d need to find a method where it won’t just self-modify to never turn into a stamp maximizer (as turning into one would prevent if from maximizing paperclips), or to not unleash a true paperclip maximizer and shut itself down if you rule out just self-modification (and so on if you were to additionally rule out just that).[1]
(though very tangentially there is a simple way to do that)