I personally place much higher likelihood on the thesis that recovering basic cooperative values (where an ASI is nice to humans and gives us some of what we wants) requires way way less than simulating “evolution”—most human values seem like they may be emergent behaviors in repeated positive-sum multi-agent games. It seems like, at least to prevent treacherous turns, we mostly need (1) bias towards multi-agent positive-sum solutions, (2) dislike of defection, (3) the “golden rule” of treating other agents as you would like to be treated (4) respect for (and gaining utility from the utility of) lesser life-forms/animals.
The primary outlier is “respect for lesser life-forms”, which I wouldn’t assume would emerge from standard cooperative multi-agent games. That seems like it might be elicitable in a repeated game of either emerging or not emerging from a Rawlsian veil of ignorance (being an animal or a human in each round).
Obviously, it’d also be good if we could transmit lots of other concepts like beauty and novelty intact to an ASI. Thankfully, people have already thought about a lot of this; there’s a whole field of “evolutionary psychology” which can be thought of as people coming up with hypotheses for the conditions of multi-agent RL environments under which different observed human/non-human behavioral patterns may emerge. We don’t know whether they’re right in practice (they primarily rely on observational evidence) but these are empirically-testable hypotheses once you have reasonably-general RL agents.
Note that a few extremely challenging concepts do remain, like “beauty”. I’m personally very skeptical that even a good simulation of all of evolution would reliably end up with the human concept of beauty—do we know if animals have any related concepts? But we may still get substantial leverage just from an ASI having sympathy for us and knowing we care about beauty.
Concretely, it’ll be useful to see people continuing to try and elicit as many such behaviors as possible in multi-agent RL, and progress on that will give us a pretty good sense of how good an alignment heuristic this would be. It could be very valuable to write out a “theory of impact” for this agenda, outlining exactly what type of success indicators would be valuable to alignment and what the components of porting a good solution would be.
I personally place much higher likelihood on the thesis that recovering basic cooperative values (where an ASI is nice to humans and gives us some of what we wants) requires way way less than simulating “evolution”—most human values seem like they may be emergent behaviors in repeated positive-sum multi-agent games. It seems like, at least to prevent treacherous turns, we mostly need (1) bias towards multi-agent positive-sum solutions, (2) dislike of defection, (3) the “golden rule” of treating other agents as you would like to be treated (4) respect for (and gaining utility from the utility of) lesser life-forms/animals.
The primary outlier is “respect for lesser life-forms”, which I wouldn’t assume would emerge from standard cooperative multi-agent games. That seems like it might be elicitable in a repeated game of either emerging or not emerging from a Rawlsian veil of ignorance (being an animal or a human in each round).
Obviously, it’d also be good if we could transmit lots of other concepts like beauty and novelty intact to an ASI. Thankfully, people have already thought about a lot of this; there’s a whole field of “evolutionary psychology” which can be thought of as people coming up with hypotheses for the conditions of multi-agent RL environments under which different observed human/non-human behavioral patterns may emerge. We don’t know whether they’re right in practice (they primarily rely on observational evidence) but these are empirically-testable hypotheses once you have reasonably-general RL agents.
Note that a few extremely challenging concepts do remain, like “beauty”. I’m personally very skeptical that even a good simulation of all of evolution would reliably end up with the human concept of beauty—do we know if animals have any related concepts? But we may still get substantial leverage just from an ASI having sympathy for us and knowing we care about beauty.
Concretely, it’ll be useful to see people continuing to try and elicit as many such behaviors as possible in multi-agent RL, and progress on that will give us a pretty good sense of how good an alignment heuristic this would be. It could be very valuable to write out a “theory of impact” for this agenda, outlining exactly what type of success indicators would be valuable to alignment and what the components of porting a good solution would be.