Why the AI Alignment Problem Might be Unsolvable?

Author’s note 1:

The fol­low­ing is a chap­ter from the story I’ve been writ­ing which con­tains, well, it con­tains what I think is prob­a­bly a proof that the value al­ign­ment prob­lem is un­solv­able. I know it sounds crazy, but as far as I can tell the proof seems to be cor­rect. There are fur­ther sup­port­ing de­tails which I can ex­plain if any­one asks, but I didn’t want to over­load you guys with too much in­for­ma­tion at once, since a lot of those ad­di­tional sup­port­ing de­tails would re­quire ar­ti­cles of their own to ex­plain.

One of my friends, who I shall not name, came up with what we think is also a proof, but it’s longer and more de­tailed and he hasn’t de­cided whether to post it.

I haven’t had time yet to ex­tract my own less de­tailed ver­sion from the nar­ra­tive di­alogue of my story, but I thought it was re­ally im­por­tant that I share it here as soon as pos­si­ble, since if I’m right, the more time wasted on AI re­search, the less time we have to come up with strate­gies and solu­tions that could more effec­tively pre­vent x-risk long term.

Author’s note 2:

This post was origi­nally more strongly worded, but I ed­ited it to tone it down a lit­tle. While those who have read Inad­e­quate Equil­ibria might con­sider that to be “epistemic hu­mil­ity” and there­fore dark­side episte­mol­ogy, I’m wor­ried not enough peo­ple on here will have read that book. Fur­ther­more, the hu­man brain, par­tic­u­larly sys­tem 1, evolved to win poli­ti­cal ar­gu­ments in the an­ces­tral en­vi­ron­ment. I’m not sure sys­tem 1 is biolog­i­cally ca­pa­ble of un­der­stand­ing the fact that epistemic hu­mil­ity is bad episte­mol­ogy. And the con­tents of this post are likely to pro­voke strong emo­tional re­ac­tions, as it pos­tu­lates that a par­tic­u­lar be­lief is false, a be­lief which ra­tio­nal­ists at large have in­vested a LOT of en­ergy, re­sources and rep­u­ta­tion into. I feel more cer­tain that the con­tents of this post are cor­rect than is wise to ex­press in a con­text likely to trig­ger strong emo­tions. Please keep this in mind. I’m be­ing up­front with you about ex­actly what I’m do­ing and why.

Author’s note 3:

Also, HEAVY SPOILERS for the story I’ve been writ­ing, Earth­lings: Peo­ple of the Dawn. This chap­ter is liter­ally the last chap­ter of part 5, af­ter which the re­main­ing parts are ba­si­cally ex­tended epi­logues. You have been warned. Also, I ed­ited the chap­ter in re­sponse to the com­ments to make things more clear.



There were guards stand­ing out­side the en­trance to the Ra­tion­al­ity In­sti­tute. They saluted Ber­tie as he ap­proached. Ber­tie nod­ded to them as he walked past. He reached the front doors and turned the han­dle, then pul­led the door open.

He stepped in­side. There was no one at the front desk. All the lights were on, but he didn’t hear any­one in the rooms he passed as he walked down the hal­lway, ap­proach­ing the door at the end.

He fi­nally stood be­fore it. It was the door to Thato’s office.

Ber­tie knocked.

“Come in,” he heard Thato say from the other side.

Ber­tie turned the knob with a sweaty hand and pushed in­wards. He stepped in­side, hop­ing that what­ever Thato wanted to talk to him about, that it wasn’t an im­mi­nent ex­is­ten­tial threat.

“Hello Ber­tie,” said Thato, somberly. He looked sweaty and tired, with bags un­der his puffy red eyes. Had he been cry­ing?

“Hi Thato,” said Ber­tie, gen­tly shut­ting the door be­hind him. He pul­led up a chair across from Thato’s desk. “What did you want to talk to me about?”

“We finished an­a­lyz­ing the re­search notes on the chip you gave us two years ago,” said Thato, dully.

“And?” asked Ber­tie. “What did you find?”

“It was com­pli­cated, it took us a long time to un­der­stand it,” said Thato. “But there was a proof in there that the value al­ign­ment prob­lem is un­solv­able.”

There was a pause, as Ber­tie’s brain tried not to pro­cess what it had just heard. Then…

“WHAT!?” Berite shouted.

“We should have re­al­ized it ear­lier,” said Thato. Then in an ac­cusatory tone, “In fact, I think you should have re­al­ized it ear­lier.”

“What!?” de­manded Ber­tie. “How? Ex­plain!”

“The re­search notes con­tained a refer­ence to a chil­dren’s story you wrote: A Tale of Four Mo­ral­ities,Thato con­tinued, his voice ris­ing.It ex­plained what you clearly already knew when you wrote it, that there are ac­tu­ally FOUR types of moral­ity, each of which has a differ­ent game-the­o­retic func­tion in hu­man so­ciety: Eye for an Eye, the Golden Rule, Max­i­mize Flour­ish­ing and Min­i­mize Suffer­ing.”

“Yes,” said Ber­tie. “And how does one go from that to ‘the Value Align­ment prob­lem is un­solv­able’?”

“Do you not see it!?” Thato de­manded.

Ber­tie shook his head.

Thato stared at Ber­tie, dumb­founded. Then he spoke slowly, as if to an idiot.

“Game the­ory de­scribes how agents with com­pet­ing goals or val­ues in­ter­act with each other. If moral­ity is game-the­o­retic by na­ture, that means it is in­her­ently de­signed for con­flict re­s­olu­tion and ei­ther main­tain­ing or achiev­ing the uni­ver­sal con­di­tions which help fa­cil­i­tate con­flict re­s­olu­tion for all agents. In other words, the whole pur­pose of moral­ity is to make it so that agents with com­pet­ing goals or val­ues can co­ex­ist peace­fully! It is some­what more com­pli­cated than that, but that is the gist.”

“I see,” said Ber­tie, his brows fur­rowed in thought. “Which means that hu­man val­ues, or at least the in­di­vi­d­ual non-moral­ity-based val­ues don’t con­verge, which means that you can’t de­sign an ar­tifi­cial su­per­in­tel­li­gence that con­tains a term for all hu­man val­ues, just the moral val­ues.”

Then Ber­tie had a sink­ing, hor­rified feel­ing ac­com­panied by a fright­en­ing in­tu­ition. He didn’t want to be­lieve it.

“Not quite,” said Thato cut­tingly. “Have you still not re­al­ized? Do you need me to spell it out?”

“Hold on a mo­ment,” said Ber­tie, try­ing to calm his rac­ing anx­iety.

What is true is already so, Ber­tie thought.

Own­ing up to it doesn’t make it worse.

Not be­ing open about it doesn’t make it go away.

And be­cause it’s true, it is what is there to be in­ter­acted with.

Peo­ple can stand what is true, for they are already en­dur­ing it.

Ber­tie took a deep breath as he con­tinued to re­cite in his mind…

If some­thing is true, then I want to be­lieve it is true.

If some­thing is not true, then I want not to be­lieve it is true.

Let me not be­come at­tached to be­liefs I may not want.

Ber­tie ex­haled, still over­whelm­ingly anx­ious. But he knew that putting off the rev­e­la­tions any longer would make it even harder to have them. He knew the thought he could not think would con­trol him more than the thought he could. And so he turned his mind in the di­rec­tion it was afraid to look.

And the epipha­nies came pour­ing out. It was a stream of con­scious­ness, no—a wa­ter­fall of con­scious­ness that wouldn’t stop. Ber­tie went from one log­i­cal step to the next, a nearly perfect dance of rigor­ously trained self-hon­esty and com­mon sense—im­perfect only in that he had waited so long to start it, to no­tice.

“So you can’t pro­gram an in­tel­li­gence to be com­pat­i­ble with all hu­man val­ues, only hu­man moral val­ues,” Ber­tie said in a rush. “Ex­cept even if you pro­grammed it to only be com­pat­i­ble with hu­man moral val­ues, there are four types of moral­ity, so you’d have four sep­a­rate and com­pet­ing util­ity func­tions to pro­gram into it. And if you did that, the in­tel­li­gence would self-edit to re­solve the in­con­sis­ten­cies be­tween its goals and that would just cause it to op­ti­mize for con­flict re­s­olu­tion, and then it would just tile the uni­verse with tiny ar­tifi­cial con­flicts be­tween ar­tifi­cial agents for it to re­solve as quickly and effi­ciently as pos­si­ble with­out let­ting those agents do any­thing them­selves.”

“Right in one,” said Thato with a gri­mace. “And as I am sure you already know, turn­ing a hu­man into a su­per­in­tel­li­gence would not work ei­ther. Hu­man val­ues are not suffi­ciently sta­ble. Yu­uto de­duced in his re­search that hu­man val­ues are in­stru­men­tal all the way down, never ter­mi­nal. Some val­ues are merely more or less in­stru­men­tal than oth­ers. That is why hu­man val­ues are over pat­terns of ex­pe­riences, which are four-di­men­sional pro­cesses, rather than over in­di­vi­d­ual des­ti­na­tions, which are three-di­men­sional end states. This is a nat­u­ral im­pli­ca­tion of the fact that hu­mans are adap­ta­tion ex­ecu­tors rather than fit­ness max­i­miz­ers. If you pro­gram a su­per­in­tel­li­gence to pro­tect hu­mans from death, grievous in­jury or other forms of ex­treme suffer­ing with­out in­fring­ing on their self-de­ter­mi­na­tion, that su­per­in­tel­li­gence would by defi­ni­tion have to stay out of hu­man af­fairs un­der most cir­cum­stances, only in­ter­ven­ing to pre­vent atroc­i­ties like mur­der, tor­ture or rape, or to deal with the oc­ca­sional ex­is­ten­tial threat and so on. If the su­per­in­tel­li­gence was a mod­ified hu­man it would even­tu­ally go mad with bore­dom and loneli­ness, and it would snap.

Thato con­tinued. “On the other hand, if a su­per­in­tel­li­gence was ar­tifi­cially de­signed it could not be pro­grammed to do that ei­ther. In­tel­li­gences are by their very na­ture op­ti­miza­tion pro­cesses. Hu­mans typ­i­cally do not re­al­ize that be­cause we each have many op­ti­miza­tion crite­ria which of­ten con­flict with each other. You can­not pro­gram a gen­eral in­tel­li­gence with a fun­da­men­tal drive to ‘not in­ter­vene in hu­man af­fairs ex­cept when things are about to go dras­ti­cally wrong oth­er­wise, where dras­ti­cally wrong is defined as ei­ther rape, tor­ture, in­vol­un­tary death, ex­treme de­bil­ity, poverty or ex­is­ten­tial threats’ be­cause that is not an op­ti­miza­tion func­tion.”

“So, to sum­ma­rize,” Ber­tie be­gan, slowly. “The very con­cept of an om­nibenev­olent god is a con­tra­dic­tion in terms. It doesn’t cor­re­spond to any­thing that could ex­ist in any self-con­sis­tent uni­verse. It is log­i­cally im­pos­si­ble.”

“Hind­sight is twenty-twenty, is it not?” asked Thato rhetor­i­cally.


“So what now?” asked Ber­tie.

“What now?” re­peated Thato. “Why, now I am go­ing to spend all of my money on frivolous things, con­sume co­pi­ous amounts of al­co­hol, say any­thing I like to any­one with­out re­gard for their feel­ings or even safety or com­mon sense, and wait for the end. Even­tu­ally, likely soon, some twit is go­ing to build a God, or blow up the world in any num­ber of other ways. That is all. It is over. We lost.”

Ber­tie stared at Thato. Then in a quiet, dan­ger­ous voice he asked, “Is that all? Is that why you sent me a mes­sage say­ing that you ur­gently wanted to meet with me in pri­vate?”

“Surely you see the benefit of do­ing so?” asked Thato. “Now you no longer will waste any more time on this fruitless en­deavor. You too may re­lax, drink, be merry and wait for the end.”

At this point Ber­tie was seething. In a de­cep­tively mild tone he asked, “Thato?”

“Yes?” asked Thato.

“May I have per­mis­sion to slap you?”

“Go ahead,” said Thato. “It does not mat­ter any­more. Noth­ing does.”

Ber­tie leaned over the desk and slapped Thato across the face, hard.

Thato seized Ber­tie’s wrist and twisted it painfully.

“That bloody hurt, you git!”

“I thought you said noth­ing mat­ters!?” Ber­tie de­manded. “Yet it clearly mat­ters to you whether you’re slapped.”

Thato re­leased Ber­tie’s wrist and looked away. Ber­tie mas­saged his wrist, try­ing to make the lin­ger­ing sting go away.

“Are you done be­ing an idiot?” he asked.

“Define ‘idiot’,” said Thato scathingly, still not look­ing at him.

“You know perfectly well what I mean,” said Ber­tie.

Thato ig­nored him.


Ber­tie clenched his fists.

“In the let­ter Yu­uto gave me be­fore he died, he told me that the knowl­edge con­tained in that chip could spell Hu­man­ity’s vic­tory or its defeat,” he said an­grily, eyes blaz­ing with de­ter­mi­na­tion. “Do you get it? Yu­uto thought his re­search could ei­ther de­stroy or save hu­mankind. He wouldn’t have given it to me if he didn’t think it could help. So I sug­gest you and your staff get back to an­a­lyz­ing it. We can figure this out, and we will.”

Ber­tie turned around and stormed out of the office.

He did not look back.