Help needed: nice AIs and presidential deaths

A pu­ta­tive new idea for AI con­trol; in­dex here.

This is a prob­lem that de­vel­oped from the “high im­pact from low im­pact” idea, but is a le­gi­t­i­mate thought ex­per­i­ment in its own right (it also has con­nec­tions with the “spirit of the law” idea).

Sup­pose that, next 1st of April, the US pres­i­dent may or may not die of nat­u­ral causes. I chose this ex­am­ple be­cause it’s an event of po­ten­tially large mag­ni­tude, but not over­whelm­ingly so (nei­ther a but­terfly wing nor an as­ter­oid im­pact).

Also as­sume that, for some rea­son, we are able to pro­gram an AI that will be nice, given that the pres­i­dent does die on that day. Its be­havi­our if the pres­i­dent doesn’t die is un­defined and po­ten­tially dan­ger­ous.

Is there a way (ei­ther at the ini­tial stages of pro­gram­ming or at the later) to ex­tend the “nice­ness” from the “pres­i­den­tial death world” into the “pres­i­den­tial sur­vival world”?

To fo­cus on how tricky the prob­lem is, as­sume for ar­gu­ment’s sake that the vice-pres­i­dent is a war mon­ger that will start a nu­clear war if they be­come pres­i­dent. Then “launch a coup on the 2nd of April” is a “nice” thing of the AI to do, con­di­tional on the pres­i­dent dy­ing. How­ever, if you naively im­port that re­quire­ment into the “pres­i­den­tial sur­vival world”, the AI will launch a poin­te­less and coun­ter­pro­duc­tive coup. This is illus­tra­tive of the kind of prob­lems that could come up.

So the ques­tion is, can we trans­fer nice­ness in this way, with­out need­ing a solu­tion to the full prob­lem of nice­ness in gen­eral?

EDIT: Ac­tu­ally, this seems ideally setup for a Bayes net­work (or for the re­quire­ment that a Bayes net­work be used).

EDIT2: Now the prob­lem of pred­i­cates like “Grue” and “Bleen” seem to be the rele­vant bit. If you can avoid con­cepts such as “X={nu­clear war if pres­i­dent died, peace if pres­i­dent lived}”, you can make the ex­ten­sion work.