a confusion about preference orderings

Here’s a confusion I have about preference orderings in decision theory.

Caveat: the observations I make below feel weirdly trivial to me, to the point that I feel wary of making a post about them at all; the specter of readers rolling their eyes and thinking “oh he’s just talking about X in a really weird way” looms large in my mind as I type. It feels like I may just be unaware of some standard term or concept in the literature, which would make everything “snap into place” if I knew about it. If so, let me know.

diagrams

Let’s say I draw something like this:

Here, the letters represent something like “world states,” and an arrow like means “C is preferred over A (by the ‘agent’ whose preferences this graph describes).”

For now I’m being hand-wavey about exactly what’s being expressed here, but I trust that this sort of thing will be familiar to an LW audience – as will the types of discussions and debate (about coherent preference, EU maximization, etc.) in which this sort of thing gets used.

I will make one clarification about the meaning here at the outset: everything that the agent’s preferences care about is captured by the “world states,” the capital letters. The only things in the situation capable of being “good/​bad” or “better/​worse” are the states.

(That’s why we can confidently draw arrows like , without having to specify anything else about what’s going on. The identities of the letters appearing on the left and right sides of an arrow are always sufficient to determine the direction of the arrow.)

So, for instance: the agent doesn’t have extra, not-shown-on-the-diagram preferences about taking particular trajectories through particular sequences of states in order. Or about being in particular states at particular “times” (in any given sense of “time”). Or about anything like that.

(If we wanted to model these kind of things, we’d need to “bring them inside the diagram” by devising a new, larger set of states that mean things like “the agent is in A after following trajectory Blah,” or whatever.)

preferred vs. accessible

Now, there’s a frequently made inference I see in those discussions, which goes from one way of interpreting the “arrows” to another.

The definition of an arrow like is the one I gave above: “the agent ‘prefers’ the state C to the state A.”

This doesn’t, in itself, say that the agent will do any particular thing. Instead, it only means something like: “all else being equal, if the agent had their choice of the two states, they’d pick C.” (What “all else being equal” means is not entirely clear at this point, but we’ll figure it out as we go along.)

The inference goes beyond the definition and claims, additionally, that in a case like this diagram, the agent will only take actions that “follow the arrows,” changing the world state from less-preferred to more-preferred at each step.

In other words, the agent will act like a “greedy optimization algorithm.”

Now, obviously, greedy optimization isn’t always best. Sometimes you have to make things a worse in the short term, as a stepping stone to making them better.

When might greedy optimization fail? Consider the diagram again:

Each arrow, again, represents a pairwise preference. What the arrows don’t necessarily means is that the agent can, in reality, make the transition directly from the state at the “tail” of the arrow to the “head” of the arrow.

To express this distinction, I’ll use a dashed arrow to mean “the same preference as a solid arrow, but ‘making the transition directly’ is not an available action.” (Thus, solid arrows now express both a preference, and the availability of the associated transition.)

Below, the A-to-C transition is desirable, but not available:

Now, suppose the agent starts out in state A. To convey that, I’ll put a box around A (this is the last piece of graphical notation, I swear):

What should the agent do, at A?

The best action, if it were available, would be to jump directly to C. But it’s not available. So, should the agent jump to B, or just stay at A forever?

Obviously, it should jump to B. Because the B-to-C transition is available from B. So the agent can make it to C starting from A, it just has to pass through B first.[1]

(Again, I worry this all sounds extremely trivial...)

But note, now, that the course of action I just said was “obviously” correct involves going “the wrong way” across the arrow connecting A to B.

That is, the agent “prefers A to B,” and yet we’re saying the agent (according to its own preferences) ought to move away from A and into B.

It’s perfectly straightforward what’s going on here. If the agent is capable of just the tiniest bit of “planning,” and isn’t constrained to only take “greedy steps,” then it can reach its favorite state, C.

However, in conversations about preference orderings, I sometimes see an implicit assumption that an arrow means both “the agent prefers one to the other” and “the agent will only choose to ‘transition’ along the arrow, not against the arrow.” (That’s the “inference” I mentioned above. In a moment I’ll present a concrete example of it.)

two readings

It seems to me that there are two ways you could interpret a directed graph representing a preference ordering, in light of the above (and given the assumption of an agent that plans ahead):

  1. Give up on “the inference.”

    1. Don’t assume that the agent will never move from a state it likes more to one it likes less. (Because it might be doing so as part of a larger plan.)

    2. This lets you retain the ability to specify a preference ordering before you say anything about which transitions are “available.” You can draw the arrows, and they mean something unambiguous – roughly “the agent will make this transition if (a) it can do so, (b) there’s no other transition it prefers even more, and (c) this is a single-step ‘bandit’ situation rather than a part of a longer sequential decision problem.”

  2. Give up on drawing the arrows before you know which transitions are available from which states.

    1. This lets you preserve the inference: the agent will always follow the arrows (but the arrow directions aren’t well defined until you know which ones are dashed and which ones are solid).

    2. For instance, in the example above, this would mean flipping the direction of the arrow between A and B in light of the fact that the A-to-C transition is unavailable.

I include 2 for completeness, but if I understand things correctly, 1 is “the right answer” if anything is.

That’s because 1 is compatible with the usual theoretical formulations of sequential decision-making (MDPs, time-discounted return, and all that stuff).

In this theoretical apparatus, you’re supposed to specify the agent’s preferences first (via something like a reward function[2]), irrespective of facts about what states are reachable from what other states. (The truth value of a statement like “the reward function returns Y when in state X” does not depend on how X might be reached, or whether it’s even reachable at all in a given episode).

Then you let the agent plan, and see where it goes.

2 would instead introduce a new, different notion of “the agent’s preferences,” which is not the one used as an input to planning, but which instead recapitulates the outputs of planning – effectively, “what the agent’s revealed preferences would be under the assumption that it’s a greedy optimizer, even if it isn’t one.”

This construct matches neither the theoretical formalizations of “preference” just mentioned, nor the common-sense meaning of the word “preference”; relatedly, it tangles up the “preferences” with external-world facts about what’s available, rather than factoring things apart cleanly. (Consider: if I suddenly come up with a new clever scheme that requires multiple steps before getting anywhere, 2 would interpret this as “my preferences” changing so that I “prefer” the outcome of each intermediate step in the plan to the outcome just before it, thereby muddling the distinction between instrumental values and terminal values.)

preference cycles

What’s a concrete example of “the inference” I’m complaining about?

Consider non-transitive preference orderings.

These can have cycles in them, which (supposedly) mean you can get money pumped, which is (supposedly) bad.

Here’s the simplest possible example of such a cycle:

Now, as you well know, this is bad, because the agent could potentially go around the A → B → C loop over and over again, and that means it can get money pum...

Wait. Is this bad?

Remember, everything that matters to the agent’s preferences is included in the states.

So, according to the agent’s preferences, there can’t possibly be something undesirable about “starting at A, going to B and then C, and finally ending up at A again.”

Why? Because once you know that “the current state is A,” you know everything that matters. It doesn’t matter how many times you’ve gone around the loop; the agent’s preferences can’t see how many times it’s gone around the loop. The box around “A” above could well mean “we’ve done 10000 loop iterations, and now we’re in A”; the agent can’t tell and doesn’t care.

Indeed, I claim this simply isn’t bad at all, conditional on the “state tells you everything” interpretation.

Okay, but why is it supposed to be bad?

Well, because the agent might have some number attached to it called “the amount of money it owns,” a number which decreases on every transition, and which thus decreases every time it goes around the loop.

If this is true – and the agent cares about this “money” number – then the graph above is wrong. Since everything the agent cares about is in the state, we’d need to bundle the money number into the state, which the graph above doesn’t do: it treats “A” as just “A,” the same identical state even after you go around and come back to it again.

So really, for a money pump, we need something like this:

where means “in ‘state A’ (from the previous graph), while owning $X.”

So, is this bad?

Not yet – at least if we’re keeping the notation consistent! Remember, a solid arrow means “the agent holds this preference, and also this transition is available.”

So the long vertical arrow here means “if you’re in A and you have $7, one of the things you can do is ‘magically acquire $3, without changing anything else above the world except that you now have 3 more dollars.’”

That’s not the kind of thing you can do in reality, but if it were, these preferences would be fine. They’d be equivalent to the earlier cycle diagram, just with an extra step. The same logic applies.

So a real money pump looks, instead, like this:

...well, sort of. Technically, this still isn’t quite a money pump: to actually get pumped over and over, the agent would need to be able to go from to a state not shown here, namely , and from there to , and so on.

Well, we could draw all those states. (I’m picturing them as a helix/​corkscrew in 3D, extending along the “more/​less money” axis.)

But we don’t need to, because the diagram just shown already contains the essential reason why a money pump is supposed to be bad. (It’s missing the aspect where the agent can be pumped >1 times, but 1 time is enough to make the point.)

If the agent starts out in and then “follows the arrows,” it will end up in and then stay there indefinitely. (Because there are no available arrows pointing out of .)

But is dispreferred relative to , the starting point; if you follow the arrows, you’ll end up wishing you could have just stayed put. So – if the agent is capable of planning ahead – it will realize that whatever it does, it definitely shouldn’t “follow the arrows.”[3]

What should it do instead, then?

Honestly, I’m not sure! It’s unclear to me what this diagram actually expresses about the agent’s preferences when planning is involved. Planning rules out “following the arrows all the way to ,” but the agent could still stop in one of several earlier states, and the diagram doesn’t make it clear which choice of stopping state is “most consistent with the preferences.”

My point is simply this: the same logic that applied to my first, acyclic diagram also applies to this one.

Even with the acyclic diagram, once we imagined that the agent might be able to plan ahead, we were forced to make a choice between

  1. Giving up on “the inference,” i.e. no longer assuming that the agent always follows the arrows

  2. Refusing to pick arrow orientations until we know all the facts about transition availability, and then orienting each arrow so that – by construction – the arrows always “get followed” by whatever actions the agent picks once it has done all its planning

Either one of these choices would “save” the agent with the cyclic preferences, blocking the conclusion that it’s going to do something obviously bad (moving to ).

The agent won’t do that; it can plan ahead and see that’s a bad idea, just like we can. The choice of 1 vs. 2 is just a choice about how to express that fact in the diagram.

“1” keeps the diagram as I drew it but allows for non-arrow-following actions. Here, the diagram doesn’t mean that the agent’s preferences are intrinsically “bad,” because that inference depends on the assumption of arrow-following actions.

“2” redefines the arrows so they always point in the direction of selected transitions. So, “2″ would involve re-drawing the diagram so it’s no longer cyclic, again removing the “badness” from the picture.

so what?

It may sound like I’m mounting a defense of cyclic preferences, above.

That’s not what I’m doing. I don’t really care about cyclic preferences one way or the other. (In part because I don’t feel I understand what it would mean to “have” or “not have” them in practice.) Cyclic preferences do seem kind of silly, but I have a suspicion that injunctions against them are more vacuous than they initially appear.

No, what I’m saying is this:

  • If there’s something wrong with cyclic preferences in a particular case, it must involve contingent properties of the world “outside the agent” in that case.

    • If you really can go all the way around the cycle, then there’s nothing wrong with that. All the relevant information is in the state, which means the agent is indifferent to “going around the cycle” – which means that if you say “going around the cycle is bad, somehow!”, you’ve simply lost track of what the agent actually prefers.

    • In a money pump, the problem is that you can’t really go all the way around the cycle, since (it is assumed) you can’t make money magically appear.

    • Because “you can’t make money magically appear” is so obviously true in the real world, it is easy to forget that it’s an extra assumption we have to make about the structure of the environment in which our agent is acting, rather than something that comes for free whenever we draw a directed graph with a cycle in it.

  • If you assume that the world is structured so that an agent “going around a money-pump cycle” cannot just magically conjure money and thus return to its exact starting state, that is now a fact about the world which the agent could notice and plan around.

  • Permitting the agent to plan means it can avoid getting money pumped.

  • If you instead assume the agent takes greedy steps, then it does get money pumped. But the same assumption also blocks even the simplest kinds of beneficial planning under acyclic preferences, as we saw in my first diagram.

    • In other words, the money pump requires “the agent is kind of stupid (i.e. it can’t plan)” as a side assumption, and the same assumption will make the agent stupid in other cases too. The “stupid” behavior is caused by the assumption of agent stupidity, not by the cycle.

Cyclic preferences and money pumps are just an example. The more general point is that I’m confused about what people mean when they say “the agent has such-and-such preferences” and point to a directed graph or something equivalent.

I’m confused because these graphs seem to get interpreted in multiple inconsistent ways. The meanings that could be attached to an arrow include

  • “The agent would choose to transition from A to B if given the option to either move to B or stay at A, and there are no other options[4], and this is a 1-step episode where nothing happens afterwards”

    • This matches “reading 1” above, in which the agent may not follow the arrows in practice when we’re rolling out a multi-step episode

    • The existence of “bad” paths that follow the arrows is fine, because the agent does not have to follow these paths; that’s not what the arrows mean

  • “The agent would choose to transition from A to B if given the option to either move to B or stay at A, and there are no other options, but there might be further actions and transitions in the episode”

    • This matches “reading 2” above, in which agent will never cross an arrow “the wrong way”

    • The existence of “bad” paths that follow the arrows would be a problem if it were to occur, but under optimal planning it wouldn’t occur, because the planning would notice the pattern and re-orient the arrows accordingly

  • If all state transitions were available from all states, and the agent was at A, then it would not stay at A; instead, it would either transition to B or some other state which in turn is preferred over B

    • This is (I think?) equivalent to the first one I listed, since they’re both ways of characterizing which transitions we’d pick if we were doing greedy optimization

    • This characterization matches the familiar intuition that the preferences capture what the agent “wishes for”: which states would be better destinations than others if it were possible to “magically” reach to any state you want, setting aside what is and isn’t feasible in reality

    • (As above) The existence of “bad” paths that follow the arrows is fine, because the agent does not have to follow these paths; that’s not what the arrows mean

None of this ambiguity is present in formal mathematical treatments of sequential decision-making, such as those commonly used in RL. These treatments draw all the necessary distinctions:

  • The reward function expresses “what one would prefer to have if one could somehow obtain it,” not necessarily what one will do in practice

  • The transition distribution function of the MDP (or the like) expresses facts about “accessibility” between states

  • A value function, Q function, or the like expresses the desirability of a state in light of planning, where the planning (unlike the reward function) is aware of facts about accessibility and may produce different outputs if those facts are modified

  • The policy expresses what the agent actually does, typically some approximation of taking “greedy steps” with respect to the value function (not with respect to the reward function!)

However, these formalisms tend to bake in an assumption of consistent preferences. (A scalar reward function can’t express rock-paper-scissors cycles, because it outputs reals or rationals or whatever, and the ordering relation defined over those numbers is transitive.) They proceed as if all the basics about EU maximization and preference consistency have already been hashed out, and now we’re building off of that foundation.

But if we keep the clear distinctions that are routine in RL, and return (with those distinctions in hand) to the “earlier” line of questioning about what sorts of preferences can be considered rational, a lot of the existing debate about those “earlier” topics feels weird and confusing.

Is the agent allowed to plan, or not? Are the preferences about “reward” or about “value”? Which transitions are available – and does it matter? What are we even talking about?

  1. ^

    I’m being hand-wavey about how the agent’s preferences are aggregated across time (e.g. whether there’s a discount rate). Let’s just suppose that, even if the agent cares more about “the state 1 transition-step from now” than “the state 2 transition-steps from now,” this effect is not so strong as to override the conclusion drawn in the main text.

  2. ^

    Let’s ignore “reward is not the optimization target”-type issues here, as I don’t think anything here depends on the way that debate gets resolved.

  3. ^

    I am being a bit too quick here: I haven’t specified what “the planning algorithm” actually is, I’m just assuming it does intuitive-seeming things. Perhaps the cycle would prevent us from defining anything worthy of the name “planning algorithm”; the arrows don’t allow us to globally order the states (and hence we can’t define a utility function reflecting that order), which leaves it unclear what the “planning algorithm” should be trying to accomplish.

    Indeed, perhaps this is the real reason we should avoid cyclic preferences – the fact that we can’t meaningfully define planning over them (if it’s in fact true that we can’t). But that objection would be quite different from “they leave you open to money pumps,” which – as explained in the main text – still strikes me as either vacuous or wrong.

  4. ^

    The “no other options” provision is there to avoid thinking about other transitions that might be even-more-preferred than this one.