Supposing humans did shut the AI down sometimes even when it takes no action:
In my model of this, the AI’s objective is to shut down iff “historical” facts are such that the humans would have shut the AI down had it taken no action. For example, maybe humans are either “patient” or “impatient”; “patient” humans won’t shut down an AI that does nothing, while “impatient” humans will. The AI’s objective is to figure out whether the humans are patient, and shut down if they are not. The AI doesn’t care about how the humans act when it does take actions, unless it can update on the humans’ actions to figure out how patient they are. So in some cases it will just ignore the shutdown signal. Does this match your model?
Supposing humans did shut the AI down sometimes even when it takes no action:
In my model of this, the AI’s objective is to shut down iff “historical” facts are such that the humans would have shut the AI down had it taken no action. For example, maybe humans are either “patient” or “impatient”; “patient” humans won’t shut down an AI that does nothing, while “impatient” humans will. The AI’s objective is to figure out whether the humans are patient, and shut down if they are not. The AI doesn’t care about how the humans act when it does take actions, unless it can update on the humans’ actions to figure out how patient they are. So in some cases it will just ignore the shutdown signal. Does this match your model?
Yep! I wrote a (hopefully clearer) explanation here: https://agentfoundations.org/item?id=927.
It covers your example at the end.