If a human wound up in that situation, they would just think about it more, repeatedly querying their ‘ground truth’ social instincts, and come up with some way that they feel about that new possibility. Whereas AGI would … I dunno, it depends on the exact code. Maybe it would form a preference quasi-randomly? Maybe it would wind up disliking everything, and wind up sitting around doing nothing until it gets outcompeted? (More on conservatism here.)
Perhaps a difference in opinion is that it’s really unclear to me that an AGI wouldn’t do much the same thing of “thinking about it more, repeatedly querying their ‘ground truth’ social instincts” that humans do. Arguably models like Claude Opus already do this where it clearly can do detailed reasoning about somewhat out-of-distribution scenarios using moral intuitions that come from somewhere. Where this somewhere is going to be some inscrutable combination of similar scenarios in pretraining data, generalization from humans talking about morality, and intuitions derived from the RLAIF phase which embeds Claude’s constitution etc. Of course we can argue that Claude’s ‘social instincts’ derived in this way are defective somehow compared to humans but it is unclear (to me) that this path cannot make AGIs with decent social instincts.
Perhaps a crux of differences in opinion between us is that I think that much more ‘alignment relevant’ morality is not created entirely by innate human social instincts but is instead learnt by our predictive world models based on external data—i.e. ‘culture’. Now culture itself obviously is downstream of a lot of our social instincts but it is also based on other factors like game-theoretic equilibria which promote cooperation even among selfish agents and, very pertinently, using logical ‘system 2’ reasoning to try to generalize and extend our inchoate social instincts and then learn to backprop this new understanding into our learnt value functions. Utilitarianism, and this super generalized EA-style compassion it brings is a great example of this. No primitive tribesman or indeed very few humans before the 18th century had ever thought of or had moral intuitions aligned with these ideas. They are profoundly unnatural to our innate ‘human social instincts’. (Some) people today feel these ideas viscerally because they have been exposed to them enough that they have propagated them from the world model back into the value function through in-lifetime learning.
We don’t have to conjure up thought experiments about aliens outside of our light cone. Throughout most of history humans have been completely uncompassionate about suffering existing literally right in front of their faces. From the beginning of time to the 18th century almost nobody had any issues with slavery despite often living with slaves or seeing slave suffering on a daily basis. Today, only a few people have moral issues with eating meat despite the enormous mountain of suffering it causes to living animals right here on our own planet while eating meat only brings reasonable (and diminishing), but not humongously massive, benefits to our quality of life.
My thinking is that this ‘far-mode’ and ‘literate/language/system2-derived’ morality is actually better for alignment and human flourishing in general than the standard set of human social instincts—i.e. I would prefer a being with the morality of Claude Opus to rule the world rather than a randomly selected human. Alignment is a high bar and ultimately we need to create minds far more ‘saintly’ than any living human could ever be.
What we then need to do is figure out how to distill this set of mostly good, highly verbal moral intuitions from culture into a value function that the model ‘feels viscerally’. Of course reverse-engineering some human social instincts are probably important here—i.e. our compassion instinct is good if generalized, and even more generally understanding how the combination of innate reward signals in the hypothalamus plus the representations in our world model gets people to feel viscerally about the fates of aliens we can never possibly interact with, is very important to understand.
Nevertheless, truly out-of-distribution things also exist, just as the world of today is truly out-of-distribution from the perspective of an ancient Egyptian.
As a side-note, it’s really unclear how good humans are at generalizing at true out-of-distribution moralities. Today’s morality likely looks pretty bad from the ancient Egyptian perspective. We are really bad at worshipping Ra and reconciling with our Ba’s. It might be the case that, upon sufficient reflection, the Egyptians would come to realize that we are right all along, but of course we would say that in any case. I don’t know how to solve this or whether there is in fact any general case solution to any degree of ‘out-of-distribution-ness’ except just like pure conservatism where you freeze both the values and the representations they are based on.
This is a really good post. Some minor musings:
Perhaps a difference in opinion is that it’s really unclear to me that an AGI wouldn’t do much the same thing of “thinking about it more, repeatedly querying their ‘ground truth’ social instincts” that humans do. Arguably models like Claude Opus already do this where it clearly can do detailed reasoning about somewhat out-of-distribution scenarios using moral intuitions that come from somewhere. Where this somewhere is going to be some inscrutable combination of similar scenarios in pretraining data, generalization from humans talking about morality, and intuitions derived from the RLAIF phase which embeds Claude’s constitution etc. Of course we can argue that Claude’s ‘social instincts’ derived in this way are defective somehow compared to humans but it is unclear (to me) that this path cannot make AGIs with decent social instincts.
Perhaps a crux of differences in opinion between us is that I think that much more ‘alignment relevant’ morality is not created entirely by innate human social instincts but is instead learnt by our predictive world models based on external data—i.e. ‘culture’. Now culture itself obviously is downstream of a lot of our social instincts but it is also based on other factors like game-theoretic equilibria which promote cooperation even among selfish agents and, very pertinently, using logical ‘system 2’ reasoning to try to generalize and extend our inchoate social instincts and then learn to backprop this new understanding into our learnt value functions. Utilitarianism, and this super generalized EA-style compassion it brings is a great example of this. No primitive tribesman or indeed very few humans before the 18th century had ever thought of or had moral intuitions aligned with these ideas. They are profoundly unnatural to our innate ‘human social instincts’. (Some) people today feel these ideas viscerally because they have been exposed to them enough that they have propagated them from the world model back into the value function through in-lifetime learning.
We don’t have to conjure up thought experiments about aliens outside of our light cone. Throughout most of history humans have been completely uncompassionate about suffering existing literally right in front of their faces. From the beginning of time to the 18th century almost nobody had any issues with slavery despite often living with slaves or seeing slave suffering on a daily basis. Today, only a few people have moral issues with eating meat despite the enormous mountain of suffering it causes to living animals right here on our own planet while eating meat only brings reasonable (and diminishing), but not humongously massive, benefits to our quality of life.
My thinking is that this ‘far-mode’ and ‘literate/language/system2-derived’ morality is actually better for alignment and human flourishing in general than the standard set of human social instincts—i.e. I would prefer a being with the morality of Claude Opus to rule the world rather than a randomly selected human. Alignment is a high bar and ultimately we need to create minds far more ‘saintly’ than any living human could ever be.
What we then need to do is figure out how to distill this set of mostly good, highly verbal moral intuitions from culture into a value function that the model ‘feels viscerally’. Of course reverse-engineering some human social instincts are probably important here—i.e. our compassion instinct is good if generalized, and even more generally understanding how the combination of innate reward signals in the hypothalamus plus the representations in our world model gets people to feel viscerally about the fates of aliens we can never possibly interact with, is very important to understand.
As a side-note, it’s really unclear how good humans are at generalizing at true out-of-distribution moralities. Today’s morality likely looks pretty bad from the ancient Egyptian perspective. We are really bad at worshipping Ra and reconciling with our Ba’s. It might be the case that, upon sufficient reflection, the Egyptians would come to realize that we are right all along, but of course we would say that in any case. I don’t know how to solve this or whether there is in fact any general case solution to any degree of ‘out-of-distribution-ness’ except just like pure conservatism where you freeze both the values and the representations they are based on.