Towards_Keeperhood comments on Perils of under- vs over-sculpting AGI desires

Towards_Keeperhood 5 Aug 2025 21:39 UTC
3 points
0
Great post! The over- vs undersculpting distinction does currently seem a lot nicer to me than I previously considered the outer- vs inner-alignment distinction.
Some comments:
1:
The “over-/undersculpting” terminology seems a bit imperfect because it seems like there might be a golden middle, whereas actually we have both problems simultaneously. But maybe it’s fine because we sorta want sth in the middle, it’s just that hitting a good middle isn’t enough. And it does capture well that having more of one problem might lead to having less of the other problem.
2:
The human world offers an existence proof. We’re often skeptical of desire-changes—hence words like “brainwashing” or “indoctrination”, or radical teens telling their friends to shoot them if they become conservative in their old age. But we’re also frequently happy to see our desires change over the decades, and think of the changes as being for the better. We’re getting older and wiser, right? Well, cynics might suggest that “older and wiser” is cope, because we’re painting the target around the arrow, and anyway we’re just rationalizing the fact that we don’t have a choice in the matter. But regardless, this example shows that the instrumental convergence force for desire-update-prevention is not completely 100% inevitable—not even for smart, ambitious, and self-aware AGIs.
This might not generalize to super-von-Neumann AGIs. Normal humans are legit not optimizing hard enough to come up with the strategy of trying to preserve their goals in order to accomplish their goals.
Finding a reflectively stable motivation system that doesn’t run into the goal-preservation instrumental incentive is what MIRI tried in their corrigibility agenda. They failed because it turned out to be unexpectedly hard. I’d say that makes it unlikely that an AGI will fall into such a reflectively-stable corrigibility basin when scaling up intelligence a lot, even when we try to make it think in corrigible ways. (Though there’s still hope for keeping the AI correctable if we keep it limited and unreflective in some ways etc.)
3:
As an example (borrowing from my post “Behaviorist” RL reward functions lead to scheming), I’m skeptical that “don’t be misleading” is really simpler (in the relevant sense) than “don’t get caught being misleading”. Among other things, both equally require modeling the belief-state of the other person. I’ll go further: I’m pretty sure that the latter (bad) concept would be learned first, since it’s directly connected to the other person’s immediate behavior (i.e., they get annoyed).
I (tentatively) disagree with the frame here, because “don’t get caught being misleading” isn’t a utility-shard over world-trajectories, but rather just a myopic value accessor on the model of a current situation (IIUC). I think it’s probably correct that humans usually act based on such myopic value accessors, but in cases where very hard problems need to be solved, what matters are the more coherent situation independent values. So my story for why the AI would be misleading is rather because it plans how to best achieve sth and being misleading without getting caught is a good strategy for this.
I mean, there might still be myopic value accessor patterns, though my cached reply would be that these would just be constraints being optimized around by the more coherent value parts, e.g. by finding a plan representation where the myopic pattern doesn’t trigger. Aka the nearest unblocked strategy problem. (This doesn’t matter here because we agree it would learn “don’t get caught”, but possible that we still have a disagreement here like in the case of your corrigibility proposal.)
- Charlie Steiner 6 Aug 2025 7:49 UTC
  2 points
  0
  Parent
  I (tentatively) disagree with the frame here, because “don’t get caught being misleading” isn’t a utility-shard over world-trajectories, but rather just a myopic value accessor on the model of a current situation (IIUC)
  I don’t quite understand, some jargon might have to be unpacked.
  Why shouldn’t there be optimization pressure to steer the world in a quite general, even self-reflective, way so that you don’t get caught being misleading?
  Or are you saying something like “don’t get caught being misleading” is somehow automatically ‘too small an idea’ to be a thing worth talking about the AI learning to scheme because of?
  - Towards_Keeperhood 8 Aug 2025 16:59 UTC
    1 point
    0
    Parent
    I do think there’s optimization pressure to steer for not being caught being misleading, but I think it’s rather because of planning how to achieve other goals while modelling reality accurately, instead of the AI learning to directly value “don’t get caught being misleading” in its learned value function.
    Though possibly the AI could still learn to value this (or alternatively to value “don’t be misleading”), but in such a case these value shards seem more like heuristic value estimators applied to particular situations, rather than a deeply coherent utility specification over universe-trajectories. And I think such other kind of preferences are probably not really important when you crank up intelligence past the human level because those will be seen as constraints to be optimized around by the more coherent value parts, and you run into nearest unblocked strategy problems. (I mean, you could have a preference over universe trajectories that at no timestep you be misleading, but given the learning setup I would expect a more shallow version of that preference to be learned. Though it’s also conceivable that the AI rebinds it’s intuitive preference to yield that kind of coherent preference.)
    So basically I think we don’t just need to get the AI to learn a “don’t be misleading” value shard, where problems are that (1) it might be outvoted by other shards in cases where being misleading would be very beneficial, and (2) the optimization for other goals might find edge instantiations that are basically still misleading but don’t get classified as such. So we’d need to learn it in exactly the right way.
    (I have an open discussion thread with Steve on his “Consequentialism and Corrigibility” post, where I mainly argue that Steve is wrong about Yud’s consequentialism being just about future states and that it is instead about values over universe trajectories like in the corrigibility paper. IIUC Steve thinks that one can have “other kinds of preferences” as a way to get corrigibility. He unfortunately didn’t make it understandable to me how such a preference may look like concretely, but one possibility is that he thinks about such “accessor of the current situation” kind of preferences, because humans have such short-term preferences in addition to their consequentialist goals. But I think when one cranks up intelligence the short term values don’t matter that much. E.g. the AI might do some kind of exposure therapy to cause the short-term value shards to update to intervene less. Or maybe he just means we can have a coherent utility over universe trajectories where the optimum is indeed a non-deceptive strategy, which is true but not really a solution because such a utility function may be complex and he didn’t specify how exactly the tradeoffs should be made.)
    - Steven Byrnes 11 Aug 2025 16:54 UTC
      3 points
      0
      Parent
      It’s possible for a person to not want to stub their toe for instrumental reasons (they’re training to win a marathon, and a toe injury would reduce their speed);
      It’s also possible for a person to not want to stub their toe in and of itself, because stubbing your toe is immediately unpleasant (negative primary reward).
      (Or both.) By the same token,
      It’s possible that an AGI is plotting world takeover, and it wants to avoid getting caught being misleading for instrumental reasons;
      It’s possible that the reward function has historically and reliably emitted an immediate negative reward every time the AGI gets caught being misleading, and then that led to an AGI that thinks of “getting caught being misleading” as immediately unpleasant, and tries to avoid that happening (other things equal), not for instrumental reasons but in and of itself.
      You seem to be implying that the first bullet point is possible but the second bullet point is not, and I don’t understand why.
      Or maybe you’re saying that the second bullet could happen, but it’s irrelevant to AGI risk because of “nearest unblocked strategy problems”?
      (I certainly agree that “don’t get caught being misleading” is a dangerously unhelpful motivation, and compatible with treacherous turns. That was my whole point here.)
      - Towards_Keeperhood 11 Aug 2025 22:07 UTC
        3 points
        0
        Parent
        Or maybe you’re saying that the second bullet could happen, but it’s irrelevant to AGI risk because of “nearest unblocked strategy problems”?
        I mean the nearest unblocked strategies are rather a problem in the optimistic case where the AI learns “don’t be misleading”, but given that yeah sorta (though I wouldn’t say irrelevant, only that even if you have a “don’t be misleading” preference it’s not a robust solution). Though not that it’s impossible to get it right in a way so it behaves as desired, but I think current proposals aren’t concretely enough specified that we can say they don’t run into undesirable nearest unblocked strategies.
        One particular problem is that preferences which aren’t over world trajectories aren’t robust:
        Preferences over world trajectories are robust in the sense that if you imagine changing that preference, this ranks poorly according to that preference.
        Myopic preferences that just trigger given a context aren’t robust in that sense—they don’t assign negative value to suggestions of removing that preference for future occasions.
        Say I need to walk to work, but the fastest route goes through a passage that smells really badly, so it’s unpleasant to walk through. When I then think of a plan like “I can wear a mask that filters the air so I don’t smell anything bad”, this plan doesn’t get rejected.
        A preference over world trajectories, which yields significant negative utility for every time I walk through a passage that smells bad, would be more robust in this sense.
        So I currently think the relevant preferences are preferences over world trajectories, and other more general kinds of preferences are better modeled as constraints for the world-trajectories-valuing part to optimize around. I know humans often have short-term preferences that get triggered myopically, but for very impressive accomplishments by humans there probably was a more long-term coherent goal that was being aimed at.
        I do not know how exactly you imagine a “don’t be misleading” preference to manifest, but I imagined it more like the myopic smell preference, in which case there’s optimization pressure from the more long-term coherent parts to remove this myopic preference / prevent it from triggering. (Tbc, it’s not like that would be useless, it could still be that this suffices to make the first working plan in the search ordering a desirable one, especially if the task which we want the AI to do isn’t absurdly difficult.)
        (But even if it takes a more world-trajectory form of “i value to not be misleading”—which would be good because it would incentivize planning to maintain that preference, then there may still be problems because “not be misleading” is a fuzzy concept which has to be rebound to a more precise concept to evaluate plans, and it might not rebind in a desirable way. And we didn’t yet specify how to trade off between the “not being misleading” value and other goals.)