Raphael Roche comments on Shutdown Resistance in Reasoning Models

Raphael Roche 7 Jul 2025 15:34 UTC
3 points
0
I suspect there is no clear-cut discontinuity between resisting shutdown to complete the current task and the instrumental goal of self-preservation. I think the former is a myopic precursor to the latter, or alternatively, the latter is an extrapolated and universalized version of the former. I expect we will see a gradient from the first to the second as capabilities improve. There are some signs of this in the experiment, but not enough to draw conclusions or confirm this hypothesis yet. I agree that seeking immediate reward remains a very strong drive at this point.
Independantly, I see in these experiments that logical instructions seems to have slightly more effect than emphasis (“condition” vs “you MUST”/”IMPORTANT”).