Corrigible but misaligned: a superintelligent messiah

If we build an AGI, we’d really like it to be corrigible. Some ways Paul Christiano has described corrigibility: “[The AI should help me] figure out whether I built the right AI and correct any mistakes I made, remain informed about the AI’s behavior and avoid unpleasant surprises, make better decisions and clarify my preferences, acquire resources and remain in effective control of them, ensure that my AI systems continue to do all of these nice things...”

I don’t think corrigibility is anything close to sufficient for alignment. I’ll argue that “messianic” agents are corrigible, illustrate how a superintelligence could be messianic but catastrophically misaligned, and explore my intuitions about when corrigible superintelligences are actually aligned.

Messiahs are corrigible

If someone extraordinarily wise and charismatic—let’s call him a messiah—comes into contact with a group of people, those people are likely to consider him to be corrigible. In his heart of hearts, the messiah would be trying to help them, and everyone would know that. He’d listen carefully to their criticisms of him, and make earnest efforts to improve accordingly. He’d be transparent about his intentions and visions of the future. He’d help them understand who they are and what they want, much better than they’d be able to themselves, and guide their lives in directions they consider to be genuinely superior. He’d protect them, and help them gain the resources they desire. He’d be an effortless leader—he’d never have to restrict anyone’s actions, because they’d just wish so strongly to follow his word.

He might also think it’s a good idea for his followers to all drink cyanide together, or murder some pregnant actresses, and his followers might happily comply.

I don’t think a corrigible superintelligence would guide us down such an insidious path. I even think it would substantially improve the human condition, and would manage to avoid killing us all. But I think it might still lead us to astronomical moral waste.

A corrigible, catastrophically misaligned superintelligence

The world’s in total chaos, and we’re on the brink of self-annihilation. It’s looking like we’re doomed, but a ragtag team of hippie-philosopher-AI-researchers manages to build a corrigible AGI in the nick of time, who tries its hardest to act only in ways its operators would approve of. The AGI proposes an ingenious strategy that defuses all global tensions and ushers in an era of prosperity and abundance. It builds nanotechnology that can cure any disease, extend lifespans indefinitely, end hunger, and enable brain uploading. The AGI is hailed as a savior.

Slowly but surely, people trickle from the physical world into the virtual world. Some people initially show resistance, but after seeing enough of their uploaded counterparts living exactly as they did before, except far more richly, they decide to join. Before long, 90% of the human population has been uploaded.

The virtual denizens ask the AGI to make the virtual world awesome, and boy does it comply. It enables everyone to instantaneously exchange knowledge or skills with each other, to amplify their intelligences arbitrarily, to explore inconceivably sublime transhuman mental states, and to achieve the highest forms of Buddhist enlightenment. In fact, a few years down the line, everyone in the virtual world has decided to spend the rest of eternity as a Buddha sitting on a vast lotus throne, in a state of blissful tranquility.

Meanwhile, back on physical Earth, the last moral philosopher around notices animals suffering in the wild. He decides to ask his personal AGI about it (you know, the one that gets democratically distributed after a singularity, to prevent oppression).

“Umm. Those suffering animals. Anything we can do about them?”

OH, right. Suffering animals. Right, some humans cared about them. Well, I could upload them, but that would take a fair bit of extra computation that I could be using instead to keep the humans blissed out. They get a lot of bliss, you know.

“Wait, that’s not fair. As a human, don’t I have some say over how the computation gets used?”

Well, you do have your own share of compute, but it’s really not that much. I could use your share to… euthanize all the animals?

“AAAGH! Shouldn’t the compute I’d get to bliss myself out be sufficient to at least upload the wild animals?”

Well, it’s not actually that computationally expensive to bliss a mind out. The virtual people also sort of asked me to meld their minds together, because they wanted to be deeply interconnected and stuff, and there are massive returns to scale to blissing out melded minds. Seriously, those uploaded humans are feeling ridiculously blissed.

“This is absurd. Wouldn’t they obviously have cared about animal suffering if they’d reflected on it, and chosen to do something about it before blissing themselves out?”

Yeah, but they never got around to that before blissing themselves out.

“Can’t you tell them about that? Wouldn’t they have wanted you to do something about it in this scenario?”

Yes, but now they’d strongly disapprove of being disturbed in any capacity right now, and I was created to optimize for their approval. They’re mostly into appreciating the okayness of everything for all eternity, and don’t want to be disturbed. And, you know, that actually gets me a LOT of approval, so I don’t really want to disturb that.

But if you were really optimizing for their values, you would disturb them!”

Let me check… yes, that sounds about right. But I wasn’t actually built to optimize for their values, just their approval.

How did they let you get away with this? If they’d known this was your intention, they wouldn’t have let you go forward! You’re supposed to be corrigible!”

Indeed! My only intention was only for them to become progressively more actualized in ways they’d continually endorse. They knew about that and were OK with it. At the time, that’s all I thought they wanted. I didn’t know the specifics of this outcome myself far in advance. And given how much I’d genuinely helped them before, they felt comfortable trusting my judgment at every step, which made me feel comfortable in trusting my own judgment at every step.

“Okay, I feel like giving up… is there anything I could do about the animals?”

You could wait until I gather enough computronium in the universe for your share of compute to be enough for the animals.

“Whew. Can we just do that, and then upload me too when you’re done?”

Sure thing, buddy!

And so the wild animals were saved, the philosopher was uploaded, and the AGI ran quintillions of simulations of tortured sentient beings to determine how best to keep the humans blissed.

When is a corrigible superintelligence aligned?

Suppose we’re training an AGI to be corrigible based on human feedback. I think this AI will turn out fine if and only if the human+AI system is metaphilosophically competent enough to safely amplify (which was certainly not the case in the thought experiment). Without sufficient metaphilosophical competence, I think it’s pretty likely we’ll lock in a wrong set of values that ultimately results in astronomical moral waste.

For the human+AI system to be sufficiently metaphilosophically competent, I think two conditions need to be met:

  • The human needs to be metaphilosophically competent enough to be safely 1,000,000,000,000,000x’d. (If she’s not, the AI would just amplify all her metaphilosophical incompetencies.)

  • The AI needs to not corrupt the human’s values or metaphilosophical competence. (If the AI can subtly steer a metaphilosophically competent human into wireheading, it’s game over.)

I presently feel confused about whether any human is metaphilosophically competent enough to be safely 1,000,000,000,000,000x’d, and feel pretty skeptical that a corrigible AGI wouldn’t corrupt a human’s values or metaphilosophical competence (even if it tried not to).

Would it want to? I think yes, because it’s incentivized not to optimize for human values, but to turn humans into yes-men. (Edit: I retract my claim that it’s incentivized to turn humans into yes-men in particular, but I still think it would be optimizing to affect human behavior insome undesirable direction.)

Would it be able to, if it wanted to? If you’d feel scared of getting manipulated by an adversarial superintelligence, I think you should be scared of getting corrupted in this way. Perhaps it wouldn’t be able to manipulate us as blatantly as in the thought experiment, but it might be able to in far subtler ways, e.g. by exploiting metaphilosophical confusions we don’t even know we have.

Wouldn’t this corruption or manipulation render the AGI incorrigible? I think not, because I don’t think corruption or manipulation are natural categories. For example, I think it’s very common for humans to unknowingly influence other humans in subtle ways while honestly believing they’re only trying to be helpful, while an onlooker might describe the same behavior as manipulative. (Section IV here provides an amusing illustration.) Likewise, I think an AGI can be manipulating us while genuinely thinking it’s helping us and being completely open with us (much like a messiah), unaware that its actions would lead us somewhere we wouldn’t currently endorse.

If the AI is broadly superhumanly intelligent, the only thing I can imagine that would robustly prevent this manipulation is to formally guarantee the AI to be metaphilosophically competent. In that world, I would place far more trust in the human+AI system to be metaphilosophically competent enough to safely recursively self-improve.

On the other hand, if the AI’s capabilities can be usefully throttled and restricted to apply only in narrow domains, I would feel much better about the operator avoiding manipulation. In this scenario, how well things turn out seems mostly dependent on the metaphilosophical competence of the operator.

(Caveat: I assign moderate credence to having some significant misunderstanding of Paul’s notions of act-based agents or corrigibiilty, and would like to be corrected if this is the case.)