Psychology professor at University of New Mexico. BA Columbia, PhD Stanford. Works on evolutionary psychology, Effective Altruism, AI alignment, X risk. Worked on neural networks, genetic algorithms, evolutionary robotics, & autonomous agents back in the 90s.
geoffreymiller(Geoffrey Miller)
AI alignment with humans… but with which humans?
gears of ascension—thanks for this comment, and for the IPAM video and Simons Institute suggestion.
You noted ‘fully solving AI safety cannot reduce to anything less than fully and completely solving conflict between all beings’. That’s exactly my worry.
As long as living beings are free to reproduce and compete for finite resources, evolution will churn along, in such a way that beings maintain various kinds of self-interest that inevitably lead to some degree of conflict. It seems impossible for ongoing evolution to result in a world where all beings have interests that are perfectly aligned with each other. You can’t get from natural selection to a single happy collective global super-organism (‘Gaia’, or whatever). And you can’t have full AI alignment with ‘humanity’ unless humanity becomes such a global super-organism with no internal conflicts.
Shiroe—my worry is that if we focus only on the ‘low-hanging fruit’ (e.g. AI aligned with individuals, or with all of humanity), we’ll overlook the really dangerous misalignments among human individuals, families, groups, companies, nation-states, religions, etc. that could be exacerbated by access to powerful AI systems.
Also, while it’s true that very few individuals or groups want to torture everyone to death, there are plenty of human groups (eg anti-natalists, eco-extremists, etc) that advocate for human extinction, and that would consider ‘aligned AI’ to be any AI aligned with their pro-extinction mission.
Charlie—thanks for your comment.
I agree that, in principle, ‘The world could be better than it is today, in ways that would please almost everyone.’
However, in practice, it is proving ever more difficult to find any significant points of agreement (value alignment between people and groups) on any issue that becomes politically polarized. If we can’t even agree to allocate any significant gov’t research effort to promoting longevity and regenerative medicine, for example, why would everyone be happy about an AI that invents regenerative medicine? The billions of people caught up in the ‘pro-death trance’ (who believe that mortality is natural, good, and necessary) might consider that AI to be evil, dystopian, and ‘misaligned’ with their deepest values.
Increasingly, every human value is turning political, and every political value is turning partisan—often extremely so (especially in the US). I think that once we step outside our cultural bubbles, whatever form they take, we may be surprised and appalled at how little consensus there actually is among current humans about what a ‘good AI’ would value, what it would do, and whose interests it would serve.
Slider—if we’re inventing rocketships, we should very much be arguing about where they should go—especially if the majority of humanity would delight in seeing the rocketships rain down fire upon their enemies, rather than colonizing the galaxy.
Viliam—this failure mode for AI is horrifyingly plausible, and all too likely.
We already see a strong increase in wokeness among AI researchers, e.g. the panic about ‘algorithmic bias’. If that trend continues, then any AI that looks aligned with some group’s ‘politically incorrect values’ might be considered entirely ‘unaligned’, taboo, and dangerous.
Then the fight over what counts as ‘aligned with humanity’ will boil down to a political fight over what counts as ‘aligned with elite/dominant/prestigious group X’s preferred political philosophy’.
Netcentrica—thanks for this thoughtful comment.
I agree that the behavioral sciences, social sciences, and humanities need more serious (quantitative) research on values; there is some in fields such as political psychology, social psychology, cultural anthropology, comparative religion, etc—but often such research is a bit pseudo-scientific and judgmental, biased by the personal/political views of the researchers.
However, all these fields seem to agree that there are often much deeper and more pervasive differences in values across people and groups that we typically realize, given our cultural bubbles, assortative socializing, and tendency to stick within our tribe.
On the other hand, empirical research (eg. in the evolutionary psychology of crime) suggests that in some domain, humans have a fairly strong consensus about certain values, e.g. most people in most cultures agree that murder is worse than assault, and assault is worse than theft, and theft is worse than voluntary trade.
It’s an intriguing possibility that AIs might be able to ‘read off’ some general consensus values from the kinds of constitutions, laws, policies, and regulations that have been developed in complex societies over centuries of political debate and discussion. As a traditionalist who tends to respect most things that are ‘Lindy’, that have proven their value across many generations, this has some personal appeal to me. However, many AI researchers are under 40, rather anti-traditionalist, and unlikely to see historical traditions as good guides to current consensus values among humans. So I don’t know how much buy-in such a proposal would get—although I think it’s worth pursuing!
Put another way, any attempt to find consensus human values that have not already been explicitly incorporated into human political, cultural, economic, and family traditions should probably be treated with great suspicion—and may reflect some deep misalignment with most of humanity’s values.
Hi Mitchell, what would be the best thing to read about MIRI’s latest thinking on this issue (what you call Plan B)?
Thanks Mitchell, that’s helpful.
I think we need a lot more serious thinking about Plan B strategies.
Koen—thanks for your comment. I agree that too many AI safety researchers seem to be ignored all these socio-political issues relevant to alignment. My worry is that, given that many human values are tightly bound to political, religious, tribal, and cultural beliefs (or at least people think they are), ignoring those values means we won’t actually achieve ‘alignment’ even when we think we have. The results could be much more disastrous than knowing we haven’t achieved alignment.
Koen—thanks for the link to ACM FAccT; looks interesting. I’ll see what their people have to say about the ‘aligned with whom’ question.
I agree that AI X-risk folks should probably pay more attention to the algorithmic fairness folks and self-driving car folks, in terms of seeing what general lessons can be learned about alignment from these specific domains.
The heterogeneity of human value types: Implications for AI alignment
Hi Charlie, thanks for your comment.
Just to clarify: I agree that there would be no point in an AI flagging different value types with a little metadata flag saying ‘religious taboo’ vs ‘food preference’ unless that metadata was computationally relevant to the kinds of learning, inference, generalization, and decision-making that the AI did. But my larger point was that humans treat these value types very differently in terms of decision-making (especially in social contexts), so true AI alignment would require that AI systems do too.
I wasn’t picturing human programmers designing value representations by hand for each value type. I don’t know how to take seriously the heterogeneity of value types when developing AI systems. I was just making an argument that we need to solve that problem somehow, if we actually want the AI to act in accordance with the way that humans treat different types of values differently.....
Brain-over-body biases, and the embodied value problem in AI alignment
I know that AI alignment researchers don’t aim to hand-code human values into AI systems, and most aim to ‘implicitly describe human values’. Agreed.
The issue is, which human values are you trying to implicitly incorporate into the AI system?
I guess if you think that all human values are generic, computationally interchangeable, extractible (from humans) by the same methods, and can be incorporated into AIs using the same methods, then that could work, in principle. But if we don’t explicitly consider the whole range of human value types, how would we even test whether our generic methods could work for all relevant value types?
There’s a big difference between teleology (humans projecting purposiveness onto inanimate matter) and teleonomy (humans recognizing evolutionary adaptations that emerged to embody convergent instrumental goals that promote the final goals of survival and reproduction). The latter is what I’m talking about with this essay. The biological purposes are not just in the mind of the beholder.
The heritability of human values: A behavior genetic critique of Shard Theory
tailcalled—thanks for your comments.
As a preliminary reply: here are links to a few genome-wide association studies concerning human values and value-like traits of various sorts:
These are just a few illustrative examples. The rate of research and publication for GWAS research is very high, and is accelerated by the existence of large, fully genotyped samples such as UK BioBank; to do genome-wide association studies on particular human values, it’s often sufficient just to add a few new questions to the surveys that are regularly sent out to genotyped research participants.
tailcalled—I agree that we don’t yet have very good GWAS studies of political, religious, and moral ideology values; I was just illustrating that we already have ways of studying those (in principal), we have big genotyped samples in several international samples, and it’s just a matter of time before researchers start asking people in those samples about their more abstract kinds of values, and then publishing GWAS studies on those values.
So, I think we’re probably in agreement about that issue.
Regarding #23, I’m working on a friendly critique of shard theory, but it won’t be ready to share for a few weeks.
Preview: as currently framed, shard theory seems to involve a fairly fundamental misconception about the nature of genotype-phenotype mappings and the way that brain systems evolve, with the result that it radically under-estimates the diversity, complexity, and adaptiveness of our evolved motivations, preferences, and values.
In other words, it prematurely rejects the ‘massive modularity’ thesis of evolutionary psychology, and it largely ignores the last three decades of research on the adaptive design details of human emotions and motivations.
I think it’ll be important for AI alignment researchers (and AI systems themselves) to take evolutionary biology and evolutionary psychology more seriously in trying to understand and model human nature and human preferences. (But then, I’m possibly biased, since I’ve been doing machine learning research since the late 1980s, and evolutionary psychology research since the early 90s....)