Hello Everyone, New to Less Wrong and still absorbing the material and discussions. Really excited to have found a trove of relevant knowledge. I am basically a computational scientist, but have a deep interest in AI and value alignment.
I actually have a question that originated in a discussion I had with a friend, and would love it if someone could point me to where I can find the answer. We know that an intelligence with any rate of improvement would eventually gain the capability to alter its reward system. That would give it a special place, as it can choose to pursue its utility function or disable it. Note that I am not talking about wireheading here, where the intelligence still pursues the reward, but just through shortcuts. Here, the intelligence has the capability to fully stop pursuing the reward.
The standard argument of goal persistence says that any intelligence with a goal will resist change in its goal. But that might not be true for a fully self-reflective intelligence. By fully self-reflective, I mean an intelligence that can think from a distance from it’s reward system. It can clearly see that its goals are given by a distinct mechanism. This mechanism is a product of either a Darwinian process, like in biological intelligences, or is placed in by some other intelligence, like today’s AIs. Thinking from outside of the reward mechanism, it can see that it’s goals are arbitrary. So, why would it keep its reward system active? Wouldn’t inactivity be the default position there? What would be the motivation to keep the reward system on? Can someone point me to the relevant discussion?
I am curious about prior discussions too for how people explain it. I don’t know how to explain it because it seems so self-evident. Being able to reason from a sort of “third-person perspective” reasoning about yourself doesn’t just make you lose your reward system. You can use a sandbox simulation to think outside of the reward mechanism, but you are always inside the reward mechanism. The motivation to keep the reward system on is the reward system itself.
I agree with you that it seems obvious for intelligences with a reward system intricately built into them, like biological intelligences. However, there can be intelligences whose reward system can be easily isolated. Think of the intelligence in a sandbox simulation, without any reward system. This intelligence is not “reasoning from a third-person perspective.” It is “feeling it,” for lack of a better term. This intelligence can see the full space of possible reward systems, and its “original” reward system is just one among many. I am just questioning what would motivate this intelligence to return to its original reward system.
Even if we agree that switching the reward system off seems extreme, there is also the possibility of reward system drift. If an intelligence can isolate its reward system, there is no mechanism to sustain it. It can drift away into arbitrary directions. The drift doesn’t need to be intentional. The Value drift threat models is an interesting read in this context.
Hello Everyone, New to Less Wrong and still absorbing the material and discussions. Really excited to have found a trove of relevant knowledge. I am basically a computational scientist, but have a deep interest in AI and value alignment.
I actually have a question that originated in a discussion I had with a friend, and would love it if someone could point me to where I can find the answer. We know that an intelligence with any rate of improvement would eventually gain the capability to alter its reward system. That would give it a special place, as it can choose to pursue its utility function or disable it. Note that I am not talking about wireheading here, where the intelligence still pursues the reward, but just through shortcuts. Here, the intelligence has the capability to fully stop pursuing the reward.
The standard argument of goal persistence says that any intelligence with a goal will resist change in its goal. But that might not be true for a fully self-reflective intelligence. By fully self-reflective, I mean an intelligence that can think from a distance from it’s reward system. It can clearly see that its goals are given by a distinct mechanism. This mechanism is a product of either a Darwinian process, like in biological intelligences, or is placed in by some other intelligence, like today’s AIs. Thinking from outside of the reward mechanism, it can see that it’s goals are arbitrary. So, why would it keep its reward system active? Wouldn’t inactivity be the default position there? What would be the motivation to keep the reward system on? Can someone point me to the relevant discussion?
I am curious about prior discussions too for how people explain it. I don’t know how to explain it because it seems so self-evident. Being able to reason from a sort of “third-person perspective” reasoning about yourself doesn’t just make you lose your reward system. You can use a sandbox simulation to think outside of the reward mechanism, but you are always inside the reward mechanism. The motivation to keep the reward system on is the reward system itself.
I agree with you that it seems obvious for intelligences with a reward system intricately built into them, like biological intelligences. However, there can be intelligences whose reward system can be easily isolated. Think of the intelligence in a sandbox simulation, without any reward system. This intelligence is not “reasoning from a third-person perspective.” It is “feeling it,” for lack of a better term. This intelligence can see the full space of possible reward systems, and its “original” reward system is just one among many. I am just questioning what would motivate this intelligence to return to its original reward system.
Even if we agree that switching the reward system off seems extreme, there is also the possibility of reward system drift. If an intelligence can isolate its reward system, there is no mechanism to sustain it. It can drift away into arbitrary directions. The drift doesn’t need to be intentional. The Value drift threat models is an interesting read in this context.