I have an in-practice answer. I don’t have a universal theoretical answer though. I’ll offer what I see, but not to override your question. Just putting this forward for consideration.
In practice, every time I’ve identified a subagent that wants something “actually bad” for me, it’s because of a kind of communication gap (which I’m partly maintaining with my judgment). It’s not like the subagent has a terminal value that’s intrinsically bad. It’s more that the goal is the only way it can see to achieve something it cares about, but I can see how pursuing and even achieving that goal would actually damage quite a lot.
The phrase “Cancer is bad for cancer” pops to mind for me here. It’s a mental tag I use for how caring about anything would, all else being equal, result in wanting to care about everything. If cancer cells could understand the impact of what they’re doing on their environment, and how would eventually lead to the death of the very context that allows them to “be immortal”, they wouldn’t want to continue doing what they’re doing. So in a quirky and kind of anthropomorphized way, cancer is a disease of context obliviousness.
Less abstractly, sometimes toddlers want things that don’t seem to make physical sense. Or they want to do something that’s dangerous or destructive. But they don’t have the capacity to recognize the problem-ness of what they’re pursuing even if it’s explained to them.
So what do?
Well, the easiest answer is to overpower the toddler. Default parenting. Also adaptive entropy incurring.
But there’s a more subtle thing that I think is healthier. It just takes longer, is trickier, and requires a kind of flip in priorities. If the toddler can feel that they’re being sincerely listened to, and that what they want is something the adult is truly valuing too, and they have a lot of experience that the adult is accounting for things the kid can’t yet but is still taking the kid’s desires seriously in contact with those unseen things, then the toddler can come to trust the adult’s “no” even when the kid doesn’t and can’t know why there’s a “no”. It’s not felt as arbitrary thwarting anymore.
This requires a lot of skill on the part of the parent. Sometimes in practice the skill-and-difficulty combo means it’s not doable, in which case “We’re not doing that because I’m bigger and I say so” is the fallback.
But as a template, I find this basically just works in terms of navigating subagents. It’s an extension of “If this agent could see what I see, they’d recognize that part of the context that relates to them getting what they want would be harmed by what they’re trying to do.” So I’m opposing the subagent not because I need to stop its stupid influence, but because that’s how I care for what it’s caring about.
If that’s really where I’m coming from, then it’s pretty easy to pass its trust tests. I just have to learn how to speak its language so to speak.
I like this question.
I have an in-practice answer. I don’t have a universal theoretical answer though. I’ll offer what I see, but not to override your question. Just putting this forward for consideration.
In practice, every time I’ve identified a subagent that wants something “actually bad” for me, it’s because of a kind of communication gap (which I’m partly maintaining with my judgment). It’s not like the subagent has a terminal value that’s intrinsically bad. It’s more that the goal is the only way it can see to achieve something it cares about, but I can see how pursuing and even achieving that goal would actually damage quite a lot.
The phrase “Cancer is bad for cancer” pops to mind for me here. It’s a mental tag I use for how caring about anything would, all else being equal, result in wanting to care about everything. If cancer cells could understand the impact of what they’re doing on their environment, and how would eventually lead to the death of the very context that allows them to “be immortal”, they wouldn’t want to continue doing what they’re doing. So in a quirky and kind of anthropomorphized way, cancer is a disease of context obliviousness.
Less abstractly, sometimes toddlers want things that don’t seem to make physical sense. Or they want to do something that’s dangerous or destructive. But they don’t have the capacity to recognize the problem-ness of what they’re pursuing even if it’s explained to them.
So what do?
Well, the easiest answer is to overpower the toddler. Default parenting. Also adaptive entropy incurring.
But there’s a more subtle thing that I think is healthier. It just takes longer, is trickier, and requires a kind of flip in priorities. If the toddler can feel that they’re being sincerely listened to, and that what they want is something the adult is truly valuing too, and they have a lot of experience that the adult is accounting for things the kid can’t yet but is still taking the kid’s desires seriously in contact with those unseen things, then the toddler can come to trust the adult’s “no” even when the kid doesn’t and can’t know why there’s a “no”. It’s not felt as arbitrary thwarting anymore.
This requires a lot of skill on the part of the parent. Sometimes in practice the skill-and-difficulty combo means it’s not doable, in which case “We’re not doing that because I’m bigger and I say so” is the fallback.
But as a template, I find this basically just works in terms of navigating subagents. It’s an extension of “If this agent could see what I see, they’d recognize that part of the context that relates to them getting what they want would be harmed by what they’re trying to do.” So I’m opposing the subagent not because I need to stop its stupid influence, but because that’s how I care for what it’s caring about.
If that’s really where I’m coming from, then it’s pretty easy to pass its trust tests. I just have to learn how to speak its language so to speak.