The thing is, we don’t have to confine ourselves to philosophy. There is also, as of roughly half a century ago, a scientific discipline studying morality, called Evolutionary Moral Psychology. Which tells us how and why humans, as social primates who live in large mostly-not-kin groups, evolved their moral instincts. Which are about iterated non-zero-sum games and forming or breaking alliances in them. In which the statement:
”…there is no social payoff to not pressing the button in any material way. This person and their family might as well exist on the other side of the planet. Any extraneous or indirect reward for not pressing the button by means of future-cooperative benefit is moot.”
is almost never true. Our moral instincts are tuned to assume that there is always a possible social effect from torturing and killing another sapient being. You may think you’re going to get away with it, but often you won’t, sooner or later.
So you need to make this an iterated game, with multiple players, imperfect information, and also imperfect secrecy. Which is more complicated, and has a bigger payoff table, but also a lot closer to the reality our moral intuitions are evolved to deal with.
You may ask how this helps with alignment. Human morality is both what we’re trying to align AI to, and also what we accidentally distilled into the base models along with our agenticness. Understanding where it came from and why it is the way it is helps us understand the target of Alignment.
I agree there’s a long and storied history behind the evolution of moral psychology, and I do think moral instinct evolved as an iterated game — even consciousness may have resulted from language implying a shared normative justification for co-operative action between agents. If two agents have shared ends they respect as self-similar, they can start to co-operate on the means.
Where I may disagree is with the implied framing that the existing tools of evolutionary moral philosophy are sufficient. I’d argue that the existence of the alignment problem (and the problem of the rescue-ability moral internalism) shows that the last half-century of descriptive moral philosophy has been insufficient at providing us the requisite tools to deal with the current circumstance. Eliezer explicitly calls out moral internalism as one of the gaps that prevents CEV from being a complete normative theory, or one that could be broadly adopted.
The iterated game framing also breaks down precisely in the circumstances alignment is worried about: An agent with decisive strategic advantage genuinely escapes iteration. The “no social payoff, might as well be on the other side of the planet” condition is an attempt to draw an analog to the circumstance a super intelligence or AI with decisive strategic advantage would actually inhabit—not a rhetorical contrivance.
I don’t read you as claiming the descriptive story is itself the justification — you’re offering a richer model of the payoff structure, which is fair. But I want to flag why I bracketed it: though the iterated-game framing is descriptively true of human psychology generally (except maybe in fringe cases like psychopaths), I don’t think the descriptive principles of moral development can serve as justification for the continued development of moral philosophy — because they themselves lack the kind of ongoing justification that all moral claims ultimately require.
If the goal is meeting the standard that rescuing moral internalism entails, the binding has to be intrinsic, not extrinsically contingent. I take this to mean making ethical considerations because you, on some level, consider other moral patienthood at least plausibly your own in a way that cannot be coherently falsified. Treating other moral patients as the subject of utility-function considerations by virtue of uncertainty is in a different class than treating them as instrumental objects to avoid punishment in certain competitive dynamics.
We’re trying to create an intelligent being that acts morally when it doesn’t have to. The Orthogonality Thesis says that’s possible, and intuition says it’s hard. For a selfish being, what evolution produces, that can’t be based simply on game theory: the game theory says “you can do what you want and there are no consequences”. Game theory looks like the second column in your table. What we want is a being that isn’t selfish at all, but “otherish”: whose utility function is aligned to our utility function. That’s even better than your Omega proposal: that’s a being whose payoff is −10m for pressing the button and 0m for not pressing the button: it ignores the cash entirely (or actually, would donate it to charity, in which case its payoff for pressing the button is −9m, matching your Omega). That’s a being whose utility function is that of an intelligent piece of humanity’s extended phenotype: something that could not evolve, but is the correct thing for us to build.
How does Evolutionary Moral Psychology help here? Well, I have a post about that…
The thing is, we don’t have to confine ourselves to philosophy. There is also, as of roughly half a century ago, a scientific discipline studying morality, called Evolutionary Moral Psychology. Which tells us how and why humans, as social primates who live in large mostly-not-kin groups, evolved their moral instincts. Which are about iterated non-zero-sum games and forming or breaking alliances in them. In which the statement:
”…there is no social payoff to not pressing the button in any material way. This person and their family might as well exist on the other side of the planet. Any extraneous or indirect reward for not pressing the button by means of future-cooperative benefit is moot.”
is almost never true. Our moral instincts are tuned to assume that there is always a possible social effect from torturing and killing another sapient being. You may think you’re going to get away with it, but often you won’t, sooner or later.
So you need to make this an iterated game, with multiple players, imperfect information, and also imperfect secrecy. Which is more complicated, and has a bigger payoff table, but also a lot closer to the reality our moral intuitions are evolved to deal with.
You may ask how this helps with alignment. Human morality is both what we’re trying to align AI to, and also what we accidentally distilled into the base models along with our agenticness. Understanding where it came from and why it is the way it is helps us understand the target of Alignment.
Thanks for the comment!
I agree there’s a long and storied history behind the evolution of moral psychology, and I do think moral instinct evolved as an iterated game — even consciousness may have resulted from language implying a shared normative justification for co-operative action between agents. If two agents have shared ends they respect as self-similar, they can start to co-operate on the means.
Where I may disagree is with the implied framing that the existing tools of evolutionary moral philosophy are sufficient. I’d argue that the existence of the alignment problem (and the problem of the rescue-ability moral internalism) shows that the last half-century of descriptive moral philosophy has been insufficient at providing us the requisite tools to deal with the current circumstance. Eliezer explicitly calls out moral internalism as one of the gaps that prevents CEV from being a complete normative theory, or one that could be broadly adopted.
The iterated game framing also breaks down precisely in the circumstances alignment is worried about: An agent with decisive strategic advantage genuinely escapes iteration. The “no social payoff, might as well be on the other side of the planet” condition is an attempt to draw an analog to the circumstance a super intelligence or AI with decisive strategic advantage would actually inhabit—not a rhetorical contrivance.
I don’t read you as claiming the descriptive story is itself the justification — you’re offering a richer model of the payoff structure, which is fair. But I want to flag why I bracketed it: though the iterated-game framing is descriptively true of human psychology generally (except maybe in fringe cases like psychopaths), I don’t think the descriptive principles of moral development can serve as justification for the continued development of moral philosophy — because they themselves lack the kind of ongoing justification that all moral claims ultimately require.
If the goal is meeting the standard that rescuing moral internalism entails, the binding has to be intrinsic, not extrinsically contingent. I take this to mean making ethical considerations because you, on some level, consider other moral patienthood at least plausibly your own in a way that cannot be coherently falsified. Treating other moral patients as the subject of utility-function considerations by virtue of uncertainty is in a different class than treating them as instrumental objects to avoid punishment in certain competitive dynamics.
We’re trying to create an intelligent being that acts morally when it doesn’t have to. The Orthogonality Thesis says that’s possible, and intuition says it’s hard. For a selfish being, what evolution produces, that can’t be based simply on game theory: the game theory says “you can do what you want and there are no consequences”. Game theory looks like the second column in your table. What we want is a being that isn’t selfish at all, but “otherish”: whose utility function is aligned to our utility function. That’s even better than your Omega proposal: that’s a being whose payoff is −10m for pressing the button and 0m for not pressing the button: it ignores the cash entirely (or actually, would donate it to charity, in which case its payoff for pressing the button is −9m, matching your Omega). That’s a being whose utility function is that of an intelligent piece of humanity’s extended phenotype: something that could not evolve, but is the correct thing for us to build.
How does Evolutionary Moral Psychology help here? Well, I have a post about that…