Dalmert comments on a confusion about preference orderings

Dalmert 12 May 2025 5:48 UTC
6 points
0
I might be missing something that’s written on this page, including comments, but if not, here is a vague understanding of mine of what people might fear regarding money pumps. I’m going to diverge from your model a bit, and use the concept of sub-world-state, denoted A’, B’, and C’, which includes everything related to the world that can be preferred except for how much money you have, which I handle separately in this comment.
A’ → B’ → C’ → A’ preferences hold in a cycle.
$M_{l e s s}$ → $M_{m o r e}$ preference also holds for money had.
I think, agents, either intrinsically, or instrumentally, (have to) simplify their decisions by factoring them at each timestep.
So they ask themselves:
Do I prefer going from A’ → B’ more than having $M_{l e s s}$ → $M_{m o r e}$ , more concretely $M_{Δ - 1}$ → $M_{Δ 0}$ ?
In this example, the non-money preference is strong, so the answer is clearly yes.
Even if the agent plans ahead a bit, and considers:
Do I prefer A’ → B’ → C’ more than having $M_{Δ - 2}$ → $M_{Δ 0}$ ?
The answer will still be a clear yes.
The interesting question is, what might someone who fears money pumps say an agent would do if it occurs to them to plan ahead enough and consider:
Do I prefer A’ → B’ → C’ → A’ more than having $M_{Δ - 3}$ → $M_{Δ 0}$ ?
According to both this and your formalisms, this agent should clearly realize that they much prefer $M_{Δ - 3}$ → $M_{Δ 0}$ and stay put at A’. And I think you are correct to ask whether planning is allowed by these different formalisms, and how it fits in.
I think concerns come in two flavors:
One is how you put it: if the agent is stupid, (or more charitably, computationally bounded, as we all are) they might not realize that they are going in circles and trading away value in the process for no (comparable(?)) benefit to themselves. Maybe agents are more prone to notice repetition and stop after a few cycles, since prediction and planning are famously hard.
The other concern is what we seem to notice in other humans and might notice in ourselves as well (and therefore might in practice diverge from idealized formalisms): sometimes we know or strongly suspect that something is likely not a good choice, and yet we do it anyway. How come? One simple answer can be how preference evaluations work in humans, if A’ → B’ is strongly enough preferred in itself, knowing or suspecting A’ → B’ → C’ → A’ + $M_{Δ - 3}$ → $M_{Δ 0}$ might not be strong enough to override it.
It might be important that, if we can, we construct agents that do not exhibit this ‘flaw’. Although one might need to be careful with such wishes since such an agent might monomaniacally pursue a peak to which it then statically sticks if reached. Which humans might dis-prefer. Which might be incoherent. (This has interested me for a while and I am not yet convinced that human values do not contain some (fundamental(?)) incoherence, e.g. in the form of such loops. For better or for worse I expanded a bit on this below, though not at all very formally and I fear less than clearly.)
So in summary, I think that if an agent
1. has static preferences over complete world states
2. is computationally boundless (enough), plans, and
3. does not ‘suffer’ from the kind of near-term bias that humans seem to
then it cannot be made worse by money pumps around things it cares about.
I think your questions are very important to be clear on as much as we can, and I only responded to a little bit of what you wrote. I might respond to more, hopefully in a more targeted and clear way, if I have more time later. And I also really hope that others also provide answers to your questions.
Some bonus pondering is below, much less connected to your post just felt nice to think through this a little and perhaps invite others’ thoughts on as well.
Let’s imagine the terminus of human preference satisfaction. Let’s assume that all preferences are fulfilled, importantly enough: in a non-wire-headed fashion. What would that look like, at least abstractly?
a) Rejecting the premise: All (human) preferences can never be fulfilled. If one has n dyson spheres, one can always wish for n+1. There will always be an orderable list of world states that we can inch ever higher on. And even if it’s hard to imagine what might someone desire if they could have everything we currently can imagine—by definition—new desires will always spring up. In a sense, dissatisfaction might be a constant companion.
b) We find a static peak of human preferences. Hard to imagine what this might be, especially if we ruled out wireheading. Hard to imagine not dis-preferring it at least a little.
c) A (small or large) cycle is found and occupied at the top. This might also fulfill the (meta-)preference against boringness. But it’s hard to escape that this might have to be a cycle. And if nothing else we are spending negentropy anyway, so maybe this is a helical money-pump spiral to the bottom still?
d) Something more chaotic is happening at the top of the preferences, with no true cycles, but maybe dynamically changing fads and fashions, never much deviating from the peak. This is hard to see how or why states would be transitioned to and from if one believes in cardinal utility. This still spends negentropy, but if we never truly return to a prior world-state even apart from that, maybe it’s not a cycle in the formal sense?
I welcome thoughts and votes on the above possibilities.