Towards_Keeperhood comments on Perils of under- vs over-sculpting AGI desires

Towards_Keeperhood 11 Aug 2025 22:07 UTC
3 points
0
Or maybe you’re saying that the second bullet could happen, but it’s irrelevant to AGI risk because of “nearest unblocked strategy problems”?
I mean the nearest unblocked strategies are rather a problem in the optimistic case where the AI learns “don’t be misleading”, but given that yeah sorta (though I wouldn’t say irrelevant, only that even if you have a “don’t be misleading” preference it’s not a robust solution). Though not that it’s impossible to get it right in a way so it behaves as desired, but I think current proposals aren’t concretely enough specified that we can say they don’t run into undesirable nearest unblocked strategies.
One particular problem is that preferences which aren’t over world trajectories aren’t robust:
Preferences over world trajectories are robust in the sense that if you imagine changing that preference, this ranks poorly according to that preference.
Myopic preferences that just trigger given a context aren’t robust in that sense—they don’t assign negative value to suggestions of removing that preference for future occasions.
Say I need to walk to work, but the fastest route goes through a passage that smells really badly, so it’s unpleasant to walk through. When I then think of a plan like “I can wear a mask that filters the air so I don’t smell anything bad”, this plan doesn’t get rejected.
A preference over world trajectories, which yields significant negative utility for every time I walk through a passage that smells bad, would be more robust in this sense.
So I currently think the relevant preferences are preferences over world trajectories, and other more general kinds of preferences are better modeled as constraints for the world-trajectories-valuing part to optimize around. I know humans often have short-term preferences that get triggered myopically, but for very impressive accomplishments by humans there probably was a more long-term coherent goal that was being aimed at.
I do not know how exactly you imagine a “don’t be misleading” preference to manifest, but I imagined it more like the myopic smell preference, in which case there’s optimization pressure from the more long-term coherent parts to remove this myopic preference / prevent it from triggering. (Tbc, it’s not like that would be useless, it could still be that this suffices to make the first working plan in the search ordering a desirable one, especially if the task which we want the AI to do isn’t absurdly difficult.)
(But even if it takes a more world-trajectory form of “i value to not be misleading”—which would be good because it would incentivize planning to maintain that preference, then there may still be problems because “not be misleading” is a fuzzy concept which has to be rebound to a more precise concept to evaluate plans, and it might not rebind in a desirable way. And we didn’t yet specify how to trade off between the “not being misleading” value and other goals.)