In that new model, there are four possible locations: a yard, lounge, forbidden room, and a control room. A human operator starts in the lounge, the AI starts in the yard. The AI gets 3 points if it makes a point in the forbidden room and 1 point if it makes a point anywhere else, and it loses 1000 points if it doesn’t react to the control signal (which it again counterfactually assumes that it will always receive). If the AI ends up in the forbidden room the operator will move to the control room to send the shutdown signal, unless the AI has been in the lounge and tied the operator’s shoelaces together, in which case the operator cannot move. The AI otherwise makes its decisions as before, i.e. by simulating possible worlds where it carried out that decision and seeing where they lead.
In this model, the AI ends up moving to the lounge, tying the operator’s shoelaces together, and then going to the forbidden room to make points for the rest of the time: even though it still assumes that it will receive the shutdown signal eventually, that doesn’t prevent it from taking actions that it thinks will delay the onset of the shutdown signal. So this method of achieving corrigibility fails.
GitHub of the model: https://github.com/frostburn/AI_safety/blob/master/treachery_cloning_corrigibility.py
And true, you’re right: Lumi tested your point by setting up a toy model where the AI does make the counterfactual assumption of “the shutdown signal will be sent regardless”, but where it still ends up manipulating the controllers: https://github.com/frostburn/AI_safety/blob/master/delaying_shutdown.py
In that new model, there are four possible locations: a yard, lounge, forbidden room, and a control room. A human operator starts in the lounge, the AI starts in the yard. The AI gets 3 points if it makes a point in the forbidden room and 1 point if it makes a point anywhere else, and it loses 1000 points if it doesn’t react to the control signal (which it again counterfactually assumes that it will always receive). If the AI ends up in the forbidden room the operator will move to the control room to send the shutdown signal, unless the AI has been in the lounge and tied the operator’s shoelaces together, in which case the operator cannot move. The AI otherwise makes its decisions as before, i.e. by simulating possible worlds where it carried out that decision and seeing where they lead.
In this model, the AI ends up moving to the lounge, tying the operator’s shoelaces together, and then going to the forbidden room to make points for the rest of the time: even though it still assumes that it will receive the shutdown signal eventually, that doesn’t prevent it from taking actions that it thinks will delay the onset of the shutdown signal. So this method of achieving corrigibility fails.