Petrov corrigibility

After my ex­am­ple of prob­lems with cor­rigi­bil­ity, and Eliezer point­ing out that some­times cor­rigi­bil­ity may in­volve say­ing “there is no cor­rigible ac­tion”, here’s a sce­nario where say­ing that may not be the op­ti­mal choice.

Petrov is, as usual for heroes, track­ing in­com­ing mis­siles in his early warn­ing com­mand cen­tre. The at­tack pat­tern seems un­likely, and he has de­cided not to in­form his lead­ers about the pos­si­ble at­tack.

His cor­rigible AI pipes up to check if he needs any ad­vice. He de­cides he does, and asks the AI to provide him with doc­u­men­ta­tion about com­puter de­tec­tion malfunc­tion. In the few min­utes it has, the AI can start with one in­tro­duc­tory text A, or with in­tro­duc­tory text B. Pre­dictably, if given A, Petrov will warn his su­pe­ri­ors (and maybe set off a nu­clear war), and, if given B, he will not.

If the cor­rigible AI says that it can­not an­swer, how­ever, Petrov will de­cide to warn his su­pe­ri­ors, as his think­ing has been knocked off track by the con­ver­sa­tion. Note that this is not what would have hap­pened had the AI stayed silent.

What is the cor­rigible thing to do in this situ­a­tion? As­sume that the AI can pre­dict Petrov’s choice for what­ever ac­tion it it­self can take.

No nominations.
No reviews.