When I was a kid I used to love playing RuneScape. One day I had what seemed like a deep insight. Why did I want to kill enemies and complete quests? In order to level up and get better equipment. Why did I want to level up and get better equipment? In order to kill enemies and complete quests. It all seemed a bit empty and circular. I don’t think I stopped playing RuneScape after that, but I would think about it every now and again and it would give me pause. In hindsight, my motivations weren’t really circular — I was playing Runescape in order to have fun, and the rest was just instrumental to that.
But my question now is — is a true circular consequentialist a stable agent that can exist in the world? An agent that wants X only because it leads to Y, and wants Y only because it leads to X?
Note, I don’t think this is the same as a circular preferences situation. The agent isn’t swapping X for Y and then Y for X over and over again, ready to get money pumped by some clever observer. It’s getting more and more X, and more and more Y over time.
Obviously if it terminally cares about both X and Y or cares about them both instrumentally for some other purpose, a normal consequentialist could display this behaviour. But do you even need terminal goals here? Can you have an agent that only cares about X instrumentally for its effect on Y, and only cares about Y instrumentally for its effect on X. In order for this to be different to just caring about X and Y terminally, I think a necessary property is that the only path through which the agent is trying to increase Y via is X, and the only path through which it’s trying to increase X via is Y.
Related thought: Having a circular preference may be preferable in terms of energy expenditure/fulfillability, because it can be implemented on a reversible computer and fulfilled infinitely without deleting any bits. (Not sure if this works with instrumental goals.)
I’m an expert on Nietzsche (I’ve read some of his books), but not a world-leading expert (I didn’t understand them). And one of the parts I didn’t understand was the psychological appeal of all this. So you’re Caesar, you’re an amazing general, and you totally wipe the floor with the Gauls. You’re a glorious military genius and will be celebrated forever in song. So . . . what? Is beating other people an end in itself? I don’t know, I guess this is how it works in sports6. But I’ve never found sports too interesting either. Also, if you defeat the Gallic armies enough times, you might find yourself ruling Gaul and making decisions about its future. Don’t you need some kind of lodestar beyond “I really like beating people”? Doesn’t that have to be something about leaving the world a better place than you found it?
Admittedly altruism also has some of this same problem. Auden said that “God put us on Earth to help others; what the others are here for, I don’t know.” At some point altruism has to bottom out in something other than altruism. Otherwise it’s all a Ponzi scheme, just people saving meaningless lives for no reason until the last life is saved and it all collapses.
I have no real answer to this question—which, in case you missed it, is “what is the meaning of life?” But I do really enjoy playing Civilization IV. And the basic structure of Civilization IV is “you mine resources, so you can build units, so you can conquer territory, so you can mine more resources, so you can build more units, so you can conquer more territory”. There are sidequests that make it less obvious. And you can eventually win by completing the tech tree (he who has ears to hear, let him listen). But the basic structure is A → B → C → A → B → C. And it’s really fun! If there’s enough bright colors, shiny toys, razor-edge battles, and risk of failure, then the kind of ratchet-y-ness of it all, the spiral where you’re doing the same things but in a bigger way each time, turns into a virtuous repetition, repetitive only in the same sense as a poem, or a melody, or the cycle of generations.
The closest I can get to the meaning of life is one of these repetitive melodies. I want to be happy so I can be strong. I want to be strong so I can be helpful. I want to be helpful because it makes me happy.
I want to help other people in order to exalt and glorify civilization. I want to exalt and glorify civilization so it can make people happy. I want them to be happy so they can be strong. I want them to be strong so they can exalt and glorify civilization. I want to exalt and glorify civilization in order to help other people.
I want to create great art to make other people happy. I want them to be happy so they can be strong. I want them to be strong so they can exalt and glorify civilization. I want to exalt and glorify civilization so it can create more great art.
I want to have children so they can be happy. I want them to be happy so they can be strong. I want them to be strong so they can raise more children. I want them to raise more children so they can exalt and glorify civilization. I want to exalt and glorify civilization so it can help more people. I want to help people so they can have more children. I want them to have children so they can be happy.
Maybe at some point there’s a hidden offramp marked “TERMINAL VALUE”. But it will be many more cycles around the spiral before I find it, and the trip itself is pleasant enough.
One way to think about this might be to cast it in the language of conditional probability. Perhaps we are modeling our agent as it makes choices between two world states, A and B, based on their predicted levels of X and Y. If P(A) is the probability that the agent chooses state A, and P(A|X) and P(A|Y) are the probabilities of choosing A given knowledge of predictions about the level of X and Y respectively in state A vs. state B, then it seems obvious to me that “cares about X only because it leads to Y” can be expressed as P(A|XY) = P(A|Y). Once we know its predictions about Y, X tells us nothing more about its likelihood of choosing state A. Likewise, “cares about Y only because it leads to X” could be expressed as P(A|XY) = P(A|X). In the statement “the agent cares about X only because it leads to Y, and it cares about Y only because it leads to X,” it seems like it’s saying that P(A|XY) = P(A|Y) ∧ P(A|XY) = P(A|X), which implies that P(A|Y) = P(A|X) -- there is perfect mutual information shared between X and Y about P(A).
However, I don’t think that this quite captures the spirit of the question, since the idea that the agent “cares about X and Y” isn’t the same thing as X and Y being predictive of which state the agent will choose. It seems like what’s wanted is a formal way to say “the only things that ‘matter’ in this world are X and Y,” which is not the same thing as saying “X and Y are the only dimensions on which world states are mapped.” We could imagine a function that takes the level of X and Y in two world states, A and B, and returns a preference order {A > B, B > A, A = B, incomparable}. But who’s to say this function isn’t just capturing an empirical regularity, rather than expressing some fundamental truth about why X and Y control the agent’s preference for A or B? However, I think that’s an issue even in the absence of any sort of circular reasoning.
A machine learning model’s training process is effectively just a way to generate a function that consistently maps an input vector to an output that’s close to a zero output from the loss function. The model doesn’t “really” value reward or avoidance of loss any more than our brains “really” value dopamine, and as far as I know, nobody has a mathematical definition of what it means to “really” value something, as opposed to behaving in a way that consistently tends to optimize for a target. From that point of view, maybe saying that P(A|Y) = P(A) really is the best we can do to mathematically express “he only cares about Y” and P(A|X) = P(A|Y) is the best way to express “he only cares about Y to get X and only cares about X to get Y.”
If you can tell me in math what that means, then you can probably make a system that does it. No guarantees on it being distinct from a more “boring” specification though.
Here’s my shot: You’re searching a game tree, and come to a state that has some X and some Y. You compute a “value of X” that’s the total discounted future “value of Y” you’ll get, conditional on your actual policy, relative to a counterfactual where you have some baseline level of X. And also you compute the “value of Y,” which is the same except it’s the (discounted, conditional, relative) expected total “value of X” you’ll get. You pick actions to steer towards a high sum of these values.
I think formally, such a circular consequentialist agent should not exist, since running a calculation of X’s utility either
Returns 0 utility, by avoiding self reference.
or
Runs in an endless loop and throws a stack overflow error, without returning any utility.
However, my guess is that in practice such an agent could exist, if we don’t insist on it being perfectly rational.
Instead of running a calculation of X’s utility, it has a intuitive guesstimate for X’s utility. A “vibe” for how much utility X has.
Over time, it adjusts its guesstimate of X’s utility based on whether X helps it acquire other things which have utility. If it discovers that X doesn’t achieve anything, it might reduce its guesstimate of X’s utility. However if it discovers that X helps it acquire Y which helps it acquire Z, and its guesstimate of Z’s utility is high, then it might increase its guesstimate of X’s utility.
And it may stay in an equilibrium where it guesses that all of these things have utility, because all of these things help it acquire one another.
I think the reason you value the items in the video game, is because humans have the mesaoptimizer goal of “success,” having something under your control grow and improve and be preserved.
Maybe one hope is that the artificial superintelligence will also have a bit of this goal, and place a bit of this value on humanity and what we wish for. Though obviously it can go wrong.
That does not look like state valued consequentialism as we typically see it, but as act valued consequentialism (In markov model this is sum of value of act (intrinsic), plus expected value of the sum of future actions) action agent with value on the acts, Use existing X to get more Y and Use existing Y to get more X. I mean, how is this different from the value on the actions X producing y, and actions Y producing X, if x and Y are scale in a particular action.
It looks money pump resistant because it wants to take those actions as many times as possible, as well as possible, and a money pump generally requires that the scale of the transactions drops over time (the resources the pumper is extracting). But then the trade is inefficient. There is probably benefits for being an efficient counterparty, but moneypumpers are inefficient counterparties.
Circular Consequentialism
When I was a kid I used to love playing RuneScape. One day I had what seemed like a deep insight. Why did I want to kill enemies and complete quests? In order to level up and get better equipment. Why did I want to level up and get better equipment? In order to kill enemies and complete quests. It all seemed a bit empty and circular. I don’t think I stopped playing RuneScape after that, but I would think about it every now and again and it would give me pause. In hindsight, my motivations weren’t really circular — I was playing Runescape in order to have fun, and the rest was just instrumental to that.
But my question now is — is a true circular consequentialist a stable agent that can exist in the world? An agent that wants X only because it leads to Y, and wants Y only because it leads to X?
Note, I don’t think this is the same as a circular preferences situation. The agent isn’t swapping X for Y and then Y for X over and over again, ready to get money pumped by some clever observer. It’s getting more and more X, and more and more Y over time.
Obviously if it terminally cares about both X and Y or cares about them both instrumentally for some other purpose, a normal consequentialist could display this behaviour. But do you even need terminal goals here? Can you have an agent that only cares about X instrumentally for its effect on Y, and only cares about Y instrumentally for its effect on X. In order for this to be different to just caring about X and Y terminally, I think a necessary property is that the only path through which the agent is trying to increase Y via is X, and the only path through which it’s trying to increase X via is Y.
Related thought: Having a circular preference may be preferable in terms of energy expenditure/fulfillability, because it can be implemented on a reversible computer and fulfilled infinitely without deleting any bits. (Not sure if this works with instrumental goals.)
I’m confused about how to think about this idea, but I really appreciate having this idea in my collection of ideas.
There’s a version of this that might make sense to you, at least if what Scott Alexander wrote here resonates:
One way to think about this might be to cast it in the language of conditional probability. Perhaps we are modeling our agent as it makes choices between two world states, A and B, based on their predicted levels of X and Y. If P(A) is the probability that the agent chooses state A, and P(A|X) and P(A|Y) are the probabilities of choosing A given knowledge of predictions about the level of X and Y respectively in state A vs. state B, then it seems obvious to me that “cares about X only because it leads to Y” can be expressed as P(A|XY) = P(A|Y). Once we know its predictions about Y, X tells us nothing more about its likelihood of choosing state A. Likewise, “cares about Y only because it leads to X” could be expressed as P(A|XY) = P(A|X). In the statement “the agent cares about X only because it leads to Y, and it cares about Y only because it leads to X,” it seems like it’s saying that P(A|XY) = P(A|Y) ∧ P(A|XY) = P(A|X), which implies that P(A|Y) = P(A|X) -- there is perfect mutual information shared between X and Y about P(A).
However, I don’t think that this quite captures the spirit of the question, since the idea that the agent “cares about X and Y” isn’t the same thing as X and Y being predictive of which state the agent will choose. It seems like what’s wanted is a formal way to say “the only things that ‘matter’ in this world are X and Y,” which is not the same thing as saying “X and Y are the only dimensions on which world states are mapped.” We could imagine a function that takes the level of X and Y in two world states, A and B, and returns a preference order {A > B, B > A, A = B, incomparable}. But who’s to say this function isn’t just capturing an empirical regularity, rather than expressing some fundamental truth about why X and Y control the agent’s preference for A or B? However, I think that’s an issue even in the absence of any sort of circular reasoning.
A machine learning model’s training process is effectively just a way to generate a function that consistently maps an input vector to an output that’s close to a zero output from the loss function. The model doesn’t “really” value reward or avoidance of loss any more than our brains “really” value dopamine, and as far as I know, nobody has a mathematical definition of what it means to “really” value something, as opposed to behaving in a way that consistently tends to optimize for a target. From that point of view, maybe saying that P(A|Y) = P(A) really is the best we can do to mathematically express “he only cares about Y” and P(A|X) = P(A|Y) is the best way to express “he only cares about Y to get X and only cares about X to get Y.”
If you can tell me in math what that means, then you can probably make a system that does it. No guarantees on it being distinct from a more “boring” specification though.
Here’s my shot: You’re searching a game tree, and come to a state that has some X and some Y. You compute a “value of X” that’s the total discounted future “value of Y” you’ll get, conditional on your actual policy, relative to a counterfactual where you have some baseline level of X. And also you compute the “value of Y,” which is the same except it’s the (discounted, conditional, relative) expected total “value of X” you’ll get. You pick actions to steer towards a high sum of these values.
I think formally, such a circular consequentialist agent should not exist, since running a calculation of X’s utility either
Returns 0 utility, by avoiding self reference.
or
Runs in an endless loop and throws a stack overflow error, without returning any utility.
However, my guess is that in practice such an agent could exist, if we don’t insist on it being perfectly rational.
Instead of running a calculation of X’s utility, it has a intuitive guesstimate for X’s utility. A “vibe” for how much utility X has.
Over time, it adjusts its guesstimate of X’s utility based on whether X helps it acquire other things which have utility. If it discovers that X doesn’t achieve anything, it might reduce its guesstimate of X’s utility. However if it discovers that X helps it acquire Y which helps it acquire Z, and its guesstimate of Z’s utility is high, then it might increase its guesstimate of X’s utility.
And it may stay in an equilibrium where it guesses that all of these things have utility, because all of these things help it acquire one another.
I think the reason you value the items in the video game, is because humans have the mesaoptimizer goal of “success,” having something under your control grow and improve and be preserved.
Maybe one hope is that the artificial superintelligence will also have a bit of this goal, and place a bit of this value on humanity and what we wish for. Though obviously it can go wrong.
That does not look like state valued consequentialism as we typically see it, but as act valued consequentialism (In markov model this is sum of value of act (intrinsic), plus expected value of the sum of future actions) action agent with value on the acts, Use existing X to get more Y and Use existing Y to get more X. I mean, how is this different from the value on the actions X producing y, and actions Y producing X, if x and Y are scale in a particular action.
It looks money pump resistant because it wants to take those actions as many times as possible, as well as possible, and a money pump generally requires that the scale of the transactions drops over time (the resources the pumper is extracting). But then the trade is inefficient. There is probably benefits for being an efficient counterparty, but moneypumpers are inefficient counterparties.