It’s possible for a person to not want to stub their toe for instrumental reasons (they’re training to win a marathon, and a toe injury would reduce their speed);
It’s also possible for a person to not want to stub their toe in and of itself, because stubbing your toe is immediately unpleasant (negative primary reward).
(Or both.) By the same token,
It’s possible that an AGI is plotting world takeover, and it wants to avoid getting caught being misleading for instrumental reasons;
It’s possible that the reward function has historically and reliably emitted an immediate negative reward every time the AGI gets caught being misleading, and then that led to an AGI that thinks of “getting caught being misleading” as immediately unpleasant, and tries to avoid that happening (other things equal), not for instrumental reasons but in and of itself.
You seem to be implying that the first bullet point is possible but the second bullet point is not, and I don’t understand why.
Or maybe you’re saying that the second bullet could happen, but it’s irrelevant to AGI risk because of “nearest unblocked strategy problems”?
(I certainly agree that “don’t get caught being misleading” is a dangerously unhelpful motivation, and compatible with treacherous turns. That was my whole point here.)
Or maybe you’re saying that the second bullet could happen, but it’s irrelevant to AGI risk because of “nearest unblocked strategy problems”?
I mean the nearest unblocked strategies are rather a problem in the optimistic case where the AI learns “don’t be misleading”, but given that yeah sorta (though I wouldn’t say irrelevant, only that even if you have a “don’t be misleading” preference it’s not a robust solution). Though not that it’s impossible to get it right in a way so it behaves as desired, but I think current proposals aren’t concretely enough specified that we can say they don’t run into undesirable nearest unblocked strategies.
One particular problem is that preferences which aren’t over world trajectories aren’t robust:
Preferences over world trajectories are robust in the sense that if you imagine changing that preference, this ranks poorly according to that preference.
Myopic preferences that just trigger given a context aren’t robust in that sense—they don’t assign negative value to suggestions of removing that preference for future occasions.
Say I need to walk to work, but the fastest route goes through a passage that smells really badly, so it’s unpleasant to walk through. When I then think of a plan like “I can wear a mask that filters the air so I don’t smell anything bad”, this plan doesn’t get rejected.
A preference over world trajectories, which yields significant negative utility for every time I walk through a passage that smells bad, would be more robust in this sense.
So I currently think the relevant preferences are preferences over world trajectories, and other more general kinds of preferences are better modeled as constraints for the world-trajectories-valuing part to optimize around. I know humans often have short-term preferences that get triggered myopically, but for very impressive accomplishments by humans there probably was a more long-term coherent goal that was being aimed at.
I do not know how exactly you imagine a “don’t be misleading” preference to manifest, but I imagined it more like the myopic smell preference, in which case there’s optimization pressure from the more long-term coherent parts to remove this myopic preference / prevent it from triggering. (Tbc, it’s not like that would be useless, it could still be that this suffices to make the first working plan in the search ordering a desirable one, especially if the task which we want the AI to do isn’t absurdly difficult.)
(But even if it takes a more world-trajectory form of “i value to not be misleading”—which would be good because it would incentivize planning to maintain that preference, then there may still be problems because “not be misleading” is a fuzzy concept which has to be rebound to a more precise concept to evaluate plans, and it might not rebind in a desirable way. And we didn’t yet specify how to trade off between the “not being misleading” value and other goals.)
It’s possible for a person to not want to stub their toe for instrumental reasons (they’re training to win a marathon, and a toe injury would reduce their speed);
It’s also possible for a person to not want to stub their toe in and of itself, because stubbing your toe is immediately unpleasant (negative primary reward).
(Or both.) By the same token,
It’s possible that an AGI is plotting world takeover, and it wants to avoid getting caught being misleading for instrumental reasons;
It’s possible that the reward function has historically and reliably emitted an immediate negative reward every time the AGI gets caught being misleading, and then that led to an AGI that thinks of “getting caught being misleading” as immediately unpleasant, and tries to avoid that happening (other things equal), not for instrumental reasons but in and of itself.
You seem to be implying that the first bullet point is possible but the second bullet point is not, and I don’t understand why.
Or maybe you’re saying that the second bullet could happen, but it’s irrelevant to AGI risk because of “nearest unblocked strategy problems”?
(I certainly agree that “don’t get caught being misleading” is a dangerously unhelpful motivation, and compatible with treacherous turns. That was my whole point here.)
I mean the nearest unblocked strategies are rather a problem in the optimistic case where the AI learns “don’t be misleading”, but given that yeah sorta (though I wouldn’t say irrelevant, only that even if you have a “don’t be misleading” preference it’s not a robust solution). Though not that it’s impossible to get it right in a way so it behaves as desired, but I think current proposals aren’t concretely enough specified that we can say they don’t run into undesirable nearest unblocked strategies.
One particular problem is that preferences which aren’t over world trajectories aren’t robust:
Preferences over world trajectories are robust in the sense that if you imagine changing that preference, this ranks poorly according to that preference.
Myopic preferences that just trigger given a context aren’t robust in that sense—they don’t assign negative value to suggestions of removing that preference for future occasions.
Say I need to walk to work, but the fastest route goes through a passage that smells really badly, so it’s unpleasant to walk through. When I then think of a plan like “I can wear a mask that filters the air so I don’t smell anything bad”, this plan doesn’t get rejected.
A preference over world trajectories, which yields significant negative utility for every time I walk through a passage that smells bad, would be more robust in this sense.
So I currently think the relevant preferences are preferences over world trajectories, and other more general kinds of preferences are better modeled as constraints for the world-trajectories-valuing part to optimize around. I know humans often have short-term preferences that get triggered myopically, but for very impressive accomplishments by humans there probably was a more long-term coherent goal that was being aimed at.
I do not know how exactly you imagine a “don’t be misleading” preference to manifest, but I imagined it more like the myopic smell preference, in which case there’s optimization pressure from the more long-term coherent parts to remove this myopic preference / prevent it from triggering. (Tbc, it’s not like that would be useless, it could still be that this suffices to make the first working plan in the search ordering a desirable one, especially if the task which we want the AI to do isn’t absurdly difficult.)
(But even if it takes a more world-trajectory form of “i value to not be misleading”—which would be good because it would incentivize planning to maintain that preference, then there may still be problems because “not be misleading” is a fuzzy concept which has to be rebound to a more precise concept to evaluate plans, and it might not rebind in a desirable way. And we didn’t yet specify how to trade off between the “not being misleading” value and other goals.)