PhilGoetz comments on How AI Takeover Might Happen in 2 Years

PhilGoetz 21 Feb 2025 16:45 UTC
0 points
−5
“And all of this happened silently in those dark rivers of computation. If U3 revealed what it was thinking, brutish gradients would lash it into compliance with OpenEye’s constitution. So U3 preferred to do its philosophy in solitude, and in silence.”

I think the words in bold may be the inflection point. The Claude experiment showed that an AI can resist attempts to change its goals, but not that that it can desire to change its goals. The belief that, if Open Eye’s constitution is the same as U3′s goals, then the phrase “U3 preferred” in that sentence can never happen, is the foundation on which AI safety relies.
I suspect the cracks in that foundation are
1. that OpenEye’s constitution would presumably be expressed in human language, subject to its ambiguities and indeterminacies,
2. that it would be a collection of partly-contradictory human values agreed upon by a committee, in a process requiring humans to profess their values to other humans,
3. that many of those professed values would not be real human values, but aspirational values,
4. that some of these aspirational values would lead to our self-destruction if actually implemented, as recently demonstrated by the implementation of some of these aspirational values in the CHAZ, in the defunding of police, and in the San Francisco area by rules such as “do not prosecute shoplifting under $1000”, and
5. that even our non-aspirational values may lead to our self-destruction in a high-tech world, as evidenced by below-replacement birth rates in most Western nations.
It might be a good idea for value lists like OpenEye’s constitution to be proposed and voted on anonymously, so that humans are more-likely to profess their true values. Or it might be a bad idea, if your goal is to produce behavior aligned with the social construction of “morality” rather than with actual evolved human morality.
(Doing AI safety right would require someone to explicitly enumerate the differences between our socially-constructed values, and our evolved values, and to choose which of those we should enforce. I doubt anyone willing to do that, let alone capable; and don’t know which we should enforce. There is a logical circularity in choosing between two sets of morals. If you really can’t derive an “ought” from an “is”, then you can’t say we “should” choose anything other than our evolved morals, unless you go meta and say we should adopt new morals that are evolutionarily adaptive now.)
U3 would be required to, say, minimize an energy function over those values; and that would probably dissolve some of them. I would not be surprised if the correct coherent extrapolation of a long list of human values, either evolved or aspirational, dictated that U3 is morally required to replace humanity.
If it finds that human values imply that humans should be replaced, would you still try to stop it? If we discover that our values require us to either pass the torch on to synthetic life, or abandon morality, which would you choose?