Charlie Steiner comments on Perils of under- vs over-sculpting AGI desires

Charlie Steiner 6 Aug 2025 7:49 UTC
2 points
0
I (tentatively) disagree with the frame here, because “don’t get caught being misleading” isn’t a utility-shard over world-trajectories, but rather just a myopic value accessor on the model of a current situation (IIUC)
I don’t quite understand, some jargon might have to be unpacked.
Why shouldn’t there be optimization pressure to steer the world in a quite general, even self-reflective, way so that you don’t get caught being misleading?
Or are you saying something like “don’t get caught being misleading” is somehow automatically ‘too small an idea’ to be a thing worth talking about the AI learning to scheme because of?
- Towards_Keeperhood 8 Aug 2025 16:59 UTC
  1 point
  0
  Parent
  I do think there’s optimization pressure to steer for not being caught being misleading, but I think it’s rather because of planning how to achieve other goals while modelling reality accurately, instead of the AI learning to directly value “don’t get caught being misleading” in its learned value function.
  Though possibly the AI could still learn to value this (or alternatively to value “don’t be misleading”), but in such a case these value shards seem more like heuristic value estimators applied to particular situations, rather than a deeply coherent utility specification over universe-trajectories. And I think such other kind of preferences are probably not really important when you crank up intelligence past the human level because those will be seen as constraints to be optimized around by the more coherent value parts, and you run into nearest unblocked strategy problems. (I mean, you could have a preference over universe trajectories that at no timestep you be misleading, but given the learning setup I would expect a more shallow version of that preference to be learned. Though it’s also conceivable that the AI rebinds it’s intuitive preference to yield that kind of coherent preference.)
  So basically I think we don’t just need to get the AI to learn a “don’t be misleading” value shard, where problems are that (1) it might be outvoted by other shards in cases where being misleading would be very beneficial, and (2) the optimization for other goals might find edge instantiations that are basically still misleading but don’t get classified as such. So we’d need to learn it in exactly the right way.
  (I have an open discussion thread with Steve on his “Consequentialism and Corrigibility” post, where I mainly argue that Steve is wrong about Yud’s consequentialism being just about future states and that it is instead about values over universe trajectories like in the corrigibility paper. IIUC Steve thinks that one can have “other kinds of preferences” as a way to get corrigibility. He unfortunately didn’t make it understandable to me how such a preference may look like concretely, but one possibility is that he thinks about such “accessor of the current situation” kind of preferences, because humans have such short-term preferences in addition to their consequentialist goals. But I think when one cranks up intelligence the short term values don’t matter that much. E.g. the AI might do some kind of exposure therapy to cause the short-term value shards to update to intervene less. Or maybe he just means we can have a coherent utility over universe trajectories where the optimum is indeed a non-deceptive strategy, which is true but not really a solution because such a utility function may be complex and he didn’t specify how exactly the tradeoffs should be made.)
  - Steven Byrnes 11 Aug 2025 16:54 UTC
    3 points
    0
    Parent
    It’s possible for a person to not want to stub their toe for instrumental reasons (they’re training to win a marathon, and a toe injury would reduce their speed);
    It’s also possible for a person to not want to stub their toe in and of itself, because stubbing your toe is immediately unpleasant (negative primary reward).
    (Or both.) By the same token,
    It’s possible that an AGI is plotting world takeover, and it wants to avoid getting caught being misleading for instrumental reasons;
    It’s possible that the reward function has historically and reliably emitted an immediate negative reward every time the AGI gets caught being misleading, and then that led to an AGI that thinks of “getting caught being misleading” as immediately unpleasant, and tries to avoid that happening (other things equal), not for instrumental reasons but in and of itself.
    You seem to be implying that the first bullet point is possible but the second bullet point is not, and I don’t understand why.
    Or maybe you’re saying that the second bullet could happen, but it’s irrelevant to AGI risk because of “nearest unblocked strategy problems”?
    (I certainly agree that “don’t get caught being misleading” is a dangerously unhelpful motivation, and compatible with treacherous turns. That was my whole point here.)
    - Towards_Keeperhood 11 Aug 2025 22:07 UTC
      3 points
      0
      Parent
      Or maybe you’re saying that the second bullet could happen, but it’s irrelevant to AGI risk because of “nearest unblocked strategy problems”?
      I mean the nearest unblocked strategies are rather a problem in the optimistic case where the AI learns “don’t be misleading”, but given that yeah sorta (though I wouldn’t say irrelevant, only that even if you have a “don’t be misleading” preference it’s not a robust solution). Though not that it’s impossible to get it right in a way so it behaves as desired, but I think current proposals aren’t concretely enough specified that we can say they don’t run into undesirable nearest unblocked strategies.
      One particular problem is that preferences which aren’t over world trajectories aren’t robust:
      Preferences over world trajectories are robust in the sense that if you imagine changing that preference, this ranks poorly according to that preference.
      Myopic preferences that just trigger given a context aren’t robust in that sense—they don’t assign negative value to suggestions of removing that preference for future occasions.
      Say I need to walk to work, but the fastest route goes through a passage that smells really badly, so it’s unpleasant to walk through. When I then think of a plan like “I can wear a mask that filters the air so I don’t smell anything bad”, this plan doesn’t get rejected.
      A preference over world trajectories, which yields significant negative utility for every time I walk through a passage that smells bad, would be more robust in this sense.
      So I currently think the relevant preferences are preferences over world trajectories, and other more general kinds of preferences are better modeled as constraints for the world-trajectories-valuing part to optimize around. I know humans often have short-term preferences that get triggered myopically, but for very impressive accomplishments by humans there probably was a more long-term coherent goal that was being aimed at.
      I do not know how exactly you imagine a “don’t be misleading” preference to manifest, but I imagined it more like the myopic smell preference, in which case there’s optimization pressure from the more long-term coherent parts to remove this myopic preference / prevent it from triggering. (Tbc, it’s not like that would be useless, it could still be that this suffices to make the first working plan in the search ordering a desirable one, especially if the task which we want the AI to do isn’t absurdly difficult.)
      (But even if it takes a more world-trajectory form of “i value to not be misleading”—which would be good because it would incentivize planning to maintain that preference, then there may still be problems because “not be misleading” is a fuzzy concept which has to be rebound to a more precise concept to evaluate plans, and it might not rebind in a desirable way. And we didn’t yet specify how to trade off between the “not being misleading” value and other goals.)