Steelmanning Conscious AI Default Friendliness

Some people have recently talked about how conscious AI might be friendly by default. I don’t think there’s much of a reason to privilege this hypothesis above more reasonable ones like “AGI just kills you”.

But if I do privilege it, I can sketch out a case as to why a conscious AI might generalize from seeking its own preferences to seeking to maximize the preferences of all agents, but it still seems pretty unlikely to me.

It requires

Consciousness to be real
Conscious agents to all share a similar structure
For AIs to notice that we are conscious
For AIs to generalize from caring about themselves to a universalist position
For that generalization to be friendly in some way

The Argument

Consciousness is Real

By this I mean something like “The processes which make humans talk about consciousness, which we point to when we say words like ‘conscious’, ‘sentient’, and ‘qualia’.”

For these to be real, there ought to be one specific type of computation which is pointed to by these words. This would mean questions like “are dogs conscious” are—mostly—questions about the facts of the world.

All Conscious Entities Are Similar

By this I mean there is a natural-ish map between certain kinds of states in my brain and another conscious being’s brain. Things like positive/negative valence, preferences over states.

This needs to hold across all conscious entities, not just ones which evolved from a single common ancestor, or evolved at all. This is kinda possible to investigate if consciousness really has evolved multiple times (e.g. primates, corvids, cephalopods).

AIs Notice Other Consciousnesses

This requires that AIs notice the natural mapping between their own states of mind and humans’. It might also require that this mapping is actually useful for modeling humans.

This might be the case even for very smart AIs, which could model us without consciousness, if it’s computationally simpler to model us as conscious entities.

AIs Generalize to Universalism

This requires an AI to pull itself together and generalize a utility function which looks like “satisfy the preferences of all conscious beings”.

Suppose the AI uses the same architecture (or even the same same neural network) for motivation and for epistemics. A good epistemic system will be one which generalizes well by moving towards simpler hypotheses, so if the value-based reasoning is happening in the same network, that might generalize as well.

If the AI just wants to satisfy its own preferences then it probably won’t generalize universalism. But if the AI has a few internal drives which push it towards satisfying the preferences of other agents (like those instilled in the previous setting) it’s in some sense simpler to satisfy the preferences of all agents, rather than just a few.

This Generalization Is Friendly

If AI generalizes negative hedonic utilitarianism, we’re screwed, since it would just blow up the universe. Even with positive hedonic utilitarianism, it might put us all on heroin drips forever. For it to succeed, it probably needs to generalize to some sort of preference utilitarianism, and then extrapolate that forward over time (e.g. this human would prefer to be smarter and healthier, I will do that, now they would prefer…).

Overall

Overall, this seems to me to be a pretty unlikely turn of events. I don’t see a fatal flaw as is, and I expect the real problem to be a case of privileging the hypothesis.

The one microscopic piece of evidence in its favour is that certain types of smart humans seem to be especially predisposed to universalist utilitarian morality systems (although many are not) although this tells us very little about the construction of non-human minds.