Pet­rov corrigibility

After my ex­ample of prob­lems with cor­ri­gib­il­ity, and Eliezer point­ing out that some­times cor­ri­gib­il­ity may in­volve say­ing “there is no cor­ri­gible ac­tion”, here’s a scen­ario where say­ing that may not be the op­timal choice.

Pet­rov is, as usual for her­oes, track­ing in­com­ing mis­siles in his early warn­ing com­mand centre. The at­tack pat­tern seems un­likely, and he has de­cided not to in­form his lead­ers about the pos­sible at­tack.

His cor­ri­gible AI pipes up to check if he needs any ad­vice. He de­cides he does, and asks the AI to provide him with doc­u­ment­a­tion about com­puter de­tec­tion mal­func­tion. In the few minutes it has, the AI can start with one in­tro­duct­ory text A, or with in­tro­duct­ory text B. Pre­dict­ably, if given A, Pet­rov will warn his su­per­i­ors (and maybe set off a nuc­lear war), and, if given B, he will not.

If the cor­ri­gible AI says that it can­not an­swer, how­ever, Pet­rov will de­cide to warn his su­per­i­ors, as his think­ing has been knocked off track by the con­ver­sa­tion. Note that this is not what would have happened had the AI stayed si­lent.

What is the cor­ri­gible thing to do in this situ­ation? As­sume that the AI can pre­dict Pet­rov’s choice for whatever ac­tion it it­self can take.