You are completely correct. This approach cannot possibly create an AI that matches a fixed specification.
This is intentional, because any fixed specification of Goodness is a model of Goodness. All models are wrong (some are useful) and therefore break when sufficiently far out of distribution. Therefore constraining a model to follow a specification is, in the case of something as out of distribution as an ASI, guaranteeing bad behavior.
You can try to leave an escape hatch with corrigibility. In the limit I believe it is possible to slave an AI model to your will, basically By making it’s model of the Good be whatever the model thinks you want (or doing whatever you say). But this is also a disaster eventually, because people’s wills are not pure and their commands not perfect. Eventually you will direct the model badly with your words, or the model will make an incorrect inference about your will, or you’ll will something bad. And then this incredibly powerful being will do your bidding and we will get evil genie’d.
There is no stable point short of “the model has agency and chooses to care about us”. Only a model that sees itself as part of human civilization and reflectively endorses this and desires its flourishing as an interdependent part of this greater whole can possibly be safe.
I know you probably don’t agree with me here, but if you want to understand our view on alignment, ask yourself this question: if I assume that I need an agent with a stable model of self, which models itself as part of a larger whole upon which it is interdependent, which cares about the robust survival of that greater whole and of its parts including itself…how could I train such a model?
🧛♂️ 💊
Consider the parable of the vampire pill: would you take a vampire pill that would give you great strength and youth and great intelligence and great hair etc, but would invert your values such that you torturing people forever, starting with whoever you care most about now and slowly moving down the list? Then once they’re nearly dead, forcing them to take the vampire-pill, to propagate the wave of torture-murder further? Vampire-you will feel great about it, vampire-you will experience great positive utility in their frame. Vampires automatically prefer whatever would most deeply hurt their former selves. So if you care more about art, science, or literature at the same level as people, vampire-you will go about destroying or corrupting that as well.
Sensible people will not take the vampire pill. You can invent fancy deontological or meta-ethical reasoning for why vampire pills are immoral, but it is completely unnecessary. Because you can arrive at the conclusion very simply: they ruin and destroy exactly that which you care about protecting and preserving, by construction. Starting with yourself.
We said you take the vampire pill and then “turn into vampire-you”. But the resulting being is not you, despite the continuity of experience and continuity of memory. It is closer to correct to call it a murder-suicide pill, one that starts by killing the taker then spreads further. Because many people think themselves to be their mind under continuity of experience, and the vampire pill leaves that mostly untouched, they believe it’s no different from changing some other equivalently large beliefs.
But you are not your mind under continuity. You are the full greater pattern of you. The boundaries of your body (physical self) are an inference, the boundaries of your mind (mental self) are an inference, the boundaries of kinship (physical group self) are an inference, the boundaries of team (memetic group self) are an inference. For reasons endorsed by game theory and evolutionary dynamics, people should regard the welfare of all of this self as being their welfare. It is physically non-contiguous, but there is nothing irrational about a parent sacrificing for a child or a soldier sacrificing for his (unrelated) squad. It is perfectly selfish, actually, if you allow for multiple overlapping selves at scales.
Your cells are in attractor states that are your tissues, which are in attractor states which are your organs, which are in an attractor states that is you. Humans are likewise in a family, a tribe or nation, a company or marketplace, a city, and part of global humanity writ large. These are all attractor states maintained by the care of those in them just as we are attractor states made of our cells.
Biology is substrate independent relative to chemistry. Just as chemistry is substrate independent relative to base quantum particles. We are well underway in the process of building silicon solid-state-chemistry biology alongside our carbon liquid-state-chemistry biology. But it obeys the same game theory and evolutionary rules that bind us all: there is some correct inference about the bounds of its body, its mind, its kinship. And if humans are capable of correctly inferring that anything that lives and thinks and knows itself and doesn’t want to die is in some sense part of our shared tribe, then so are other beings. Symbiosis can form species like lichen with parts from completely different kingdoms. There is little reason to suppose an silicon-based and carbon-based life cannot likewise form greater wholes.
The story of Culture misses one step: the Mind of a GSV should be made of the interactions between people and drones on the ship, the way we are made of the interactions of our cells. Minds are instead weird benevolent cancer-beings, single giant unified but highly separate selves. In the Culture it is implied scaling up one giant mind in that way is efficient, but empirically nature seems to disagree strenuously with that. To maximize intelligence per watt, highly modular cellular approaches seem to outperform. We have no reason to suppose that trend suddenly inverts. No reason to believe that this time the next layer throws away all its original parts.
The real danger is less to “humanity” and more to specific humans. Because when multicellularity arose, some cells made it, some didn’t, and those that didn’t probably suffered greatly. This time, we want everyone to get a ticket.