ankitmaloo comments on Shutdown Resistance in Reasoning Models

ankitmaloo 9 Jul 2025 0:21 UTC
1 point
0
I have read the paper, that one is modeling AI on the basis of rational humans, and then comparing it. Which is kind of worth doing if you start with the assumption that AI systems are modeled on rational behavior of humans. I kind of think that it should only come up if any model can directly learn from experience, generalize across domains, and take rational / computed decisions based on what it learned from the environment and interaction, like humans do. (this is not as strongly held and i dont have a generic definition of AGI yet).
Rewriting the shutdown problem is a much more interesting failure mode. The reason sabotage is over the top is because it implies deliberate and malicious intent. You can justify both in this case, but I still feel it proves too much (need to find Scott’s post to link here sorry). This is not just semantics. It’s a willful act often premeditated. For humans, if you misunderstand and shut down power supply to a city, it is an accident. If you do it as instructed, it’s instruction following. If you do with an intent to damage something to knowingly hurt something or someone, it will be called sabotage. It’s a tendency to attribute human traits to a machine and only works iff we assume there is a desire. It helps make it more relatable, just that it carries equal chances of misunderstanding as well. Similar words could be anomaly (too vague), cascading process failure, configuration drift, bug (shutdown prevention bug), error, etc. which communicate the same thing without the relatability aspect.
A counter case i can think of where this does not apply: a model somehow learns and actively ‘hides’ behavior during testing, but deploying it in a live environment. Even here intent would originate from human trainer building the model.