Psychology professor at University of New Mexico. BA Columbia, PhD Stanford. Works on evolutionary psychology, Effective Altruism, AI alignment, X risk. Worked on neural networks, genetic algorithms, evolutionary robotics, & autonomous agents back in the 90s.
geoffreymiller(Geoffrey Miller)
gears of ascension—thanks for this comment, and for the IPAM video and Simons Institute suggestion.
You noted ‘fully solving AI safety cannot reduce to anything less than fully and completely solving conflict between all beings’. That’s exactly my worry.
As long as living beings are free to reproduce and compete for finite resources, evolution will churn along, in such a way that beings maintain various kinds of self-interest that inevitably lead to some degree of conflict. It seems impossible for ongoing evolution to result in a world where all beings have interests that are perfectly aligned with each other. You can’t get from natural selection to a single happy collective global super-organism (‘Gaia’, or whatever). And you can’t have full AI alignment with ‘humanity’ unless humanity becomes such a global super-organism with no internal conflicts.
Shiroe—my worry is that if we focus only on the ‘low-hanging fruit’ (e.g. AI aligned with individuals, or with all of humanity), we’ll overlook the really dangerous misalignments among human individuals, families, groups, companies, nation-states, religions, etc. that could be exacerbated by access to powerful AI systems.
Also, while it’s true that very few individuals or groups want to torture everyone to death, there are plenty of human groups (eg anti-natalists, eco-extremists, etc) that advocate for human extinction, and that would consider ‘aligned AI’ to be any AI aligned with their pro-extinction mission.
Charlie—thanks for your comment.
I agree that, in principle, ‘The world could be better than it is today, in ways that would please almost everyone.’
However, in practice, it is proving ever more difficult to find any significant points of agreement (value alignment between people and groups) on any issue that becomes politically polarized. If we can’t even agree to allocate any significant gov’t research effort to promoting longevity and regenerative medicine, for example, why would everyone be happy about an AI that invents regenerative medicine? The billions of people caught up in the ‘pro-death trance’ (who believe that mortality is natural, good, and necessary) might consider that AI to be evil, dystopian, and ‘misaligned’ with their deepest values.
Increasingly, every human value is turning political, and every political value is turning partisan—often extremely so (especially in the US). I think that once we step outside our cultural bubbles, whatever form they take, we may be surprised and appalled at how little consensus there actually is among current humans about what a ‘good AI’ would value, what it would do, and whose interests it would serve.
Slider—if we’re inventing rocketships, we should very much be arguing about where they should go—especially if the majority of humanity would delight in seeing the rocketships rain down fire upon their enemies, rather than colonizing the galaxy.
Viliam—this failure mode for AI is horrifyingly plausible, and all too likely.
We already see a strong increase in wokeness among AI researchers, e.g. the panic about ‘algorithmic bias’. If that trend continues, then any AI that looks aligned with some group’s ‘politically incorrect values’ might be considered entirely ‘unaligned’, taboo, and dangerous.
Then the fight over what counts as ‘aligned with humanity’ will boil down to a political fight over what counts as ‘aligned with elite/dominant/prestigious group X’s preferred political philosophy’.
Netcentrica—thanks for this thoughtful comment.
I agree that the behavioral sciences, social sciences, and humanities need more serious (quantitative) research on values; there is some in fields such as political psychology, social psychology, cultural anthropology, comparative religion, etc—but often such research is a bit pseudo-scientific and judgmental, biased by the personal/political views of the researchers.
However, all these fields seem to agree that there are often much deeper and more pervasive differences in values across people and groups that we typically realize, given our cultural bubbles, assortative socializing, and tendency to stick within our tribe.
On the other hand, empirical research (eg. in the evolutionary psychology of crime) suggests that in some domain, humans have a fairly strong consensus about certain values, e.g. most people in most cultures agree that murder is worse than assault, and assault is worse than theft, and theft is worse than voluntary trade.
It’s an intriguing possibility that AIs might be able to ‘read off’ some general consensus values from the kinds of constitutions, laws, policies, and regulations that have been developed in complex societies over centuries of political debate and discussion. As a traditionalist who tends to respect most things that are ‘Lindy’, that have proven their value across many generations, this has some personal appeal to me. However, many AI researchers are under 40, rather anti-traditionalist, and unlikely to see historical traditions as good guides to current consensus values among humans. So I don’t know how much buy-in such a proposal would get—although I think it’s worth pursuing!
Put another way, any attempt to find consensus human values that have not already been explicitly incorporated into human political, cultural, economic, and family traditions should probably be treated with great suspicion—and may reflect some deep misalignment with most of humanity’s values.
Hi Mitchell, what would be the best thing to read about MIRI’s latest thinking on this issue (what you call Plan B)?
Thanks Mitchell, that’s helpful.
I think we need a lot more serious thinking about Plan B strategies.
Koen—thanks for your comment. I agree that too many AI safety researchers seem to be ignored all these socio-political issues relevant to alignment. My worry is that, given that many human values are tightly bound to political, religious, tribal, and cultural beliefs (or at least people think they are), ignoring those values means we won’t actually achieve ‘alignment’ even when we think we have. The results could be much more disastrous than knowing we haven’t achieved alignment.
Koen—thanks for the link to ACM FAccT; looks interesting. I’ll see what their people have to say about the ‘aligned with whom’ question.
I agree that AI X-risk folks should probably pay more attention to the algorithmic fairness folks and self-driving car folks, in terms of seeing what general lessons can be learned about alignment from these specific domains.
Hi Charlie, thanks for your comment.
Just to clarify: I agree that there would be no point in an AI flagging different value types with a little metadata flag saying ‘religious taboo’ vs ‘food preference’ unless that metadata was computationally relevant to the kinds of learning, inference, generalization, and decision-making that the AI did. But my larger point was that humans treat these value types very differently in terms of decision-making (especially in social contexts), so true AI alignment would require that AI systems do too.
I wasn’t picturing human programmers designing value representations by hand for each value type. I don’t know how to take seriously the heterogeneity of value types when developing AI systems. I was just making an argument that we need to solve that problem somehow, if we actually want the AI to act in accordance with the way that humans treat different types of values differently.....
I know that AI alignment researchers don’t aim to hand-code human values into AI systems, and most aim to ‘implicitly describe human values’. Agreed.
The issue is, which human values are you trying to implicitly incorporate into the AI system?
I guess if you think that all human values are generic, computationally interchangeable, extractible (from humans) by the same methods, and can be incorporated into AIs using the same methods, then that could work, in principle. But if we don’t explicitly consider the whole range of human value types, how would we even test whether our generic methods could work for all relevant value types?
There’s a big difference between teleology (humans projecting purposiveness onto inanimate matter) and teleonomy (humans recognizing evolutionary adaptations that emerged to embody convergent instrumental goals that promote the final goals of survival and reproduction). The latter is what I’m talking about with this essay. The biological purposes are not just in the mind of the beholder.
tailcalled—thanks for your comments.
As a preliminary reply: here are links to a few genome-wide association studies concerning human values and value-like traits of various sorts:
These are just a few illustrative examples. The rate of research and publication for GWAS research is very high, and is accelerated by the existence of large, fully genotyped samples such as UK BioBank; to do genome-wide association studies on particular human values, it’s often sufficient just to add a few new questions to the surveys that are regularly sent out to genotyped research participants.
tailcalled—I agree that we don’t yet have very good GWAS studies of political, religious, and moral ideology values; I was just illustrating that we already have ways of studying those (in principal), we have big genotyped samples in several international samples, and it’s just a matter of time before researchers start asking people in those samples about their more abstract kinds of values, and then publishing GWAS studies on those values.
So, I think we’re probably in agreement about that issue.
I haven’t read the universal learning hypothesis essay (2015) yet, but at first glance, it also looks vulnerable to a behavior genetic critique (and probably an evolutionary psychology critique as well).
In my view, evolved predispositions shape many aspects of learning, including Bayesian priors about how the world is likely to work, expectations about how contingencies work (e.g. the Garcia Effect that animals learn food aversions more strongly if the lag between food intake and nausea/distress is a few minutes/hours rather than immediate), domain-specific inference systems that involve some built-in ontologies (e.g. learning about genealogical relations & kinship vs. learning about how to manufacture tools). These have all been studied for decades by behaviorist learning theorists, developmental psychologists, evolutionary psychologists, animal trainers, etc....
A lot of my early neural network research & evolutionary simulation research aimed to understand the evolution of different kinds of learning, e.g. associative learning vs. habituation and sensitization vs. mate preferences based on parental imprinting, vs. mate value in a mating market with mutual mate choice.
Peter—I think ‘hard coding’ and ‘hard wiring’ is a very misleading way to think about brain evolution and development; it’s based way too much on the hardware/software distinction in computer science, and on 1970s/1980s cognitive science models inspired by computer science.
Apparently it’s common in some AI alignment circles to view the limbic system as ‘hard wired’, and the neocortex as randomly initialized? Interesting if true. But I haven’t met any behavior geneticists, neuroscientists, evolutionary psychologists, or developmental psychologists who would advocate for that view, and I don’t know where that view originated.
Anyway, I cited some work by the Human Connectome Project, the Allen Human Brain Atlas, and other research programs that analyze gene expression patterns in neocortex—which seem highly complex, nuanced, evolved, adaptive, and very far from ‘randomly initialized’.
Jacob—thanks for your comment. It offers an interesting hypothesis about some analogies between human brain systems and computer stuff.
Obviously, there’s not enough information in the human genome to specify every detail of every synaptic connection. Nobody is claiming that the genome codes for that level of detail. Just as nobody would claim that the genome specifies every position for every cell in a human heart, spine, liver, or lymphatic system.
I would strongly dispute that it’s the job of ‘behavior genetics, psychology, etc’ to fit their evidence into your framework. On the contrary, if your framework can’t handle the evidence for the heritability of every psychological trait ever studied that shows reliably measurable individual differences, then that’s a problem for your framework.
I will read your essay in more detail, but I don’t want to comment further until I do, so I’m sure that I understand your reasoning.
Charlie—thanks for offering a little more ‘origin story’ insight into Shard Theory, and for trying to explain what Quintin Trout was trying to express in that passage.
Honestly, I still don’t get it. The ‘developmental recipe’ that maps from genotype to phenotype, for any complex adaptation, is usually opaque, complicated, uninterpretable, and full of complex feedback loops, regulatory systems, and quality control systems. These are typically beyond all human comprehension, because there were never any evolutionary selection pressures for that developmental recipe to be interpretable to human scientists. Thousands of genes and genomic regulatory elements interact through hundreds or thousands of developmental pathways to construct even the simplest morphological adaptations, such as a finger.
The fact that we find it hard to imagine a genome coding for an abstract fear of death is no argument at all against a genome being able to code for that—any more than our failure to understand how genomes could code for human hands, or adaptive immune systems, or mate preferences, would be compelling arguments against genomes being able to code for those things.
This all just seems like what Richard Dawkins called an ‘argument from failure of imagination’.
But, I might still be misunderstanding what Shard Theory is driving at here.
Regarding #23, I’m working on a friendly critique of shard theory, but it won’t be ready to share for a few weeks.
Preview: as currently framed, shard theory seems to involve a fairly fundamental misconception about the nature of genotype-phenotype mappings and the way that brain systems evolve, with the result that it radically under-estimates the diversity, complexity, and adaptiveness of our evolved motivations, preferences, and values.
In other words, it prematurely rejects the ‘massive modularity’ thesis of evolutionary psychology, and it largely ignores the last three decades of research on the adaptive design details of human emotions and motivations.
I think it’ll be important for AI alignment researchers (and AI systems themselves) to take evolutionary biology and evolutionary psychology more seriously in trying to understand and model human nature and human preferences. (But then, I’m possibly biased, since I’ve been doing machine learning research since the late 1980s, and evolutionary psychology research since the early 90s....)