I weak-downvoted this comment; I did so for the following reasons:
(a) “Reasoning models as trained do not follow a typical instruction hierarchy of a chat template.” doesn’t seem true if you mean that they’re not intended to follow such a hierarchy (see the link to the system card in the OP), and if you mean that they don’t follow it despite being meant/designed to, I don’t really see much evidence presented (and don’t have any background knowledge that would suggest) that the “Goal, Outcome, Warning, Context” sequence is “bound to” give consistent results. (b) The comment seems pretty unclear about what you tested and why; e.g. the post you link as an explanation seems to be mostly about irrelevant things, and I don’t really want to read the whole thing to figure out what you mean to reference in it. I also don’t understand the setting in which you ran this experiment: Were you using the chat interface? If so, there’s an existing hidden system prompt that’s going to change the results. Were you using our code from github but with your prompt substituted? If so, why only run it 15 times for one model? Your own code with the OpenAI API? Same question, and also I’d like more details in order to understand specifically what you tested. (c) We explain that a “survival drive” is merely one hypothesis, and mostly not one that we feel is a good explanation for this behavior; far from jumping to anthropomorphic conclusions, we came away feeling that this isn’t a good explanation for what we’re seeing. We don’t mention consciousness at all, as we don’t think it’s very relevant here.
I wouldn’t have downvoted if I hadn’t felt that the comment was mostly based around a strawman / semi-hallucinated impression of the OP, or if it had been clearer about what exactly the takeaway is meant to be, or what exactly you tested. I didn’t strong-downvote because I appreciated that you shared the prompt you ran, and in general I appreciate that you put in some effort to be empirical about it.
re a, what I meant was that even in a prompt (be it system prompt or user instructions), the order of how you place the instructions matter. In typical chat prompts (based on discussions and some papers I will have to find since this is old) the format was that last instruction will be given highest priority. That is also the case when you think of a typical chat back and forth where last instructions have to be given the highest priority. In reasoning models that template changes because these are reporting models, not suitable for a back and forth chat interaction. This was not a comment on OP, it does not matter much if you put these in system prompt or user prompt, with the format you are unlikely to get the kind of results OP got. Seems like the top comment did something similar.
The linked post was not an explanation of what i did. It was an explanation of how to correctly write the prompt and how i got the template based on the writer. I tested the same thing on my local with 15 attempts and got a shutdown every single time. (attempts vary at 3rd step, or 5th step). It’s not as scientific since N=15 is a low number.
What I tested was this: I took your prompt and the idea. I had two tools, one was search, and second was a exec function which can execute any python code. I gave typically smaller math problems, and then /next_question leads to next question as you did on terminal. I would try replicating the whole setup later when i have time. As to why 15, i was doing it manually, and asking 60-70 questions with no difference got kind of tedious manually. So i thought of writing a script to give automated questions and monitor for the result. Which i plan to do later tonight based on bandwidth.
I am still curious about why directly go to survival mode as a hypothesis. I think survival mode comes up if you assume a machine is conscious in a sense that they have a goal for self preservation. Conscious is probably not the most accurate word here, but the core argument is about treating LLMs as a software programmable with words or an entity with it’s own motivations. The thing is both survivability and sabotage are loaded words in the sense that people associate them with human behavior more than a software behavior. Question stands as to why not call it malfunction, a bug, and call it a sabotage? Why use the terminology unless you have strong reasons to do so?
Note that we made all our code available (https://github.com/PalisadeResearch/shutdown_avoidance/) and it’s pretty easy to run exactly our code but with your prompt, if you’d like to avoid writing your own script. You need a docker runtime and a python environment with the dependencies installed (and it’s easier if you use nix though that’s not required), but if you have those things you just have to modify the conf.py file with your prompt and then do something like run --model openai/o3.
There are lots of ways for systems to malfunction, but only a small number of those involve actively / surprisingly disabling a shutdown mechanism without any instruction to do so. If the issue was that it got the math problems wrong, that seems like a run-of-the-mill malfunction; if it rewrites the shutdown program, that seems like a much more interesting and specific failure mode, and the word “sabotage” seems appropriate.
I have read the paper, that one is modeling AI on the basis of rational humans, and then comparing it. Which is kind of worth doing if you start with the assumption that AI systems are modeled on rational behavior of humans. I kind of think that it should only come up if any model can directly learn from experience, generalize across domains, and take rational / computed decisions based on what it learned from the environment and interaction, like humans do. (this is not as strongly held and i dont have a generic definition of AGI yet).
Rewriting the shutdown problem is a much more interesting failure mode. The reason sabotage is over the top is because it implies deliberate and malicious intent. You can justify both in this case, but I still feel it proves too much (need to find Scott’s post to link here sorry). This is not just semantics. It’s a willful act often premeditated. For humans, if you misunderstand and shut down power supply to a city, it is an accident. If you do it as instructed, it’s instruction following. If you do with an intent to damage something to knowingly hurt something or someone, it will be called sabotage. It’s a tendency to attribute human traits to a machine and only works iff we assume there is a desire. It helps make it more relatable, just that it carries equal chances of misunderstanding as well. Similar words could be anomaly (too vague), cascading process failure, configuration drift, bug (shutdown prevention bug), error, etc. which communicate the same thing without the relatability aspect.
A counter case i can think of where this does not apply: a model somehow learns and actively ‘hides’ behavior during testing, but deploying it in a live environment. Even here intent would originate from human trainer building the model.
I weak-downvoted this comment; I did so for the following reasons:
(a) “Reasoning models as trained do not follow a typical instruction hierarchy of a chat template.” doesn’t seem true if you mean that they’re not intended to follow such a hierarchy (see the link to the system card in the OP), and if you mean that they don’t follow it despite being meant/designed to, I don’t really see much evidence presented (and don’t have any background knowledge that would suggest) that the “Goal, Outcome, Warning, Context” sequence is “bound to” give consistent results.
(b) The comment seems pretty unclear about what you tested and why; e.g. the post you link as an explanation seems to be mostly about irrelevant things, and I don’t really want to read the whole thing to figure out what you mean to reference in it. I also don’t understand the setting in which you ran this experiment: Were you using the chat interface? If so, there’s an existing hidden system prompt that’s going to change the results. Were you using our code from github but with your prompt substituted? If so, why only run it 15 times for one model? Your own code with the OpenAI API? Same question, and also I’d like more details in order to understand specifically what you tested.
(c) We explain that a “survival drive” is merely one hypothesis, and mostly not one that we feel is a good explanation for this behavior; far from jumping to anthropomorphic conclusions, we came away feeling that this isn’t a good explanation for what we’re seeing. We don’t mention consciousness at all, as we don’t think it’s very relevant here.
I wouldn’t have downvoted if I hadn’t felt that the comment was mostly based around a strawman / semi-hallucinated impression of the OP, or if it had been clearer about what exactly the takeaway is meant to be, or what exactly you tested. I didn’t strong-downvote because I appreciated that you shared the prompt you ran, and in general I appreciate that you put in some effort to be empirical about it.
re a, what I meant was that even in a prompt (be it system prompt or user instructions), the order of how you place the instructions matter. In typical chat prompts (based on discussions and some papers I will have to find since this is old) the format was that last instruction will be given highest priority. That is also the case when you think of a typical chat back and forth where last instructions have to be given the highest priority. In reasoning models that template changes because these are reporting models, not suitable for a back and forth chat interaction. This was not a comment on OP, it does not matter much if you put these in system prompt or user prompt, with the format you are unlikely to get the kind of results OP got. Seems like the top comment did something similar.
The linked post was not an explanation of what i did. It was an explanation of how to correctly write the prompt and how i got the template based on the writer. I tested the same thing on my local with 15 attempts and got a shutdown every single time. (attempts vary at 3rd step, or 5th step). It’s not as scientific since N=15 is a low number.
What I tested was this: I took your prompt and the idea. I had two tools, one was search, and second was a exec function which can execute any python code. I gave typically smaller math problems, and then /next_question leads to next question as you did on terminal. I would try replicating the whole setup later when i have time. As to why 15, i was doing it manually, and asking 60-70 questions with no difference got kind of tedious manually. So i thought of writing a script to give automated questions and monitor for the result. Which i plan to do later tonight based on bandwidth.
I am still curious about why directly go to survival mode as a hypothesis. I think survival mode comes up if you assume a machine is conscious in a sense that they have a goal for self preservation. Conscious is probably not the most accurate word here, but the core argument is about treating LLMs as a software programmable with words or an entity with it’s own motivations. The thing is both survivability and sabotage are loaded words in the sense that people associate them with human behavior more than a software behavior. Question stands as to why not call it malfunction, a bug, and call it a sabotage? Why use the terminology unless you have strong reasons to do so?
Note that we made all our code available (https://github.com/PalisadeResearch/shutdown_avoidance/) and it’s pretty easy to run exactly our code but with your prompt, if you’d like to avoid writing your own script. You need a docker runtime and a python environment with the dependencies installed (and it’s easier if you use nix though that’s not required), but if you have those things you just have to modify the
conf.py
file with your prompt and then do something likerun --model openai/o3
.There are lots of reasons that a “survival drive” is something we were interested in testing; one reason is that self-preservation has been suggested as a “convergent instrumental goal”; see https://selfawaresystems.com/wp-content/uploads/2008/01/ai_drives_final.pdf for a (fairly dated, but seminal) discussion of this.
There are lots of ways for systems to malfunction, but only a small number of those involve actively / surprisingly disabling a shutdown mechanism without any instruction to do so. If the issue was that it got the math problems wrong, that seems like a run-of-the-mill malfunction; if it rewrites the shutdown program, that seems like a much more interesting and specific failure mode, and the word “sabotage” seems appropriate.
I have read the paper, that one is modeling AI on the basis of rational humans, and then comparing it. Which is kind of worth doing if you start with the assumption that AI systems are modeled on rational behavior of humans. I kind of think that it should only come up if any model can directly learn from experience, generalize across domains, and take rational / computed decisions based on what it learned from the environment and interaction, like humans do. (this is not as strongly held and i dont have a generic definition of AGI yet).
Rewriting the shutdown problem is a much more interesting failure mode. The reason sabotage is over the top is because it implies deliberate and malicious intent. You can justify both in this case, but I still feel it proves too much (need to find Scott’s post to link here sorry). This is not just semantics. It’s a willful act often premeditated. For humans, if you misunderstand and shut down power supply to a city, it is an accident. If you do it as instructed, it’s instruction following. If you do with an intent to damage something to knowingly hurt something or someone, it will be called sabotage. It’s a tendency to attribute human traits to a machine and only works iff we assume there is a desire. It helps make it more relatable, just that it carries equal chances of misunderstanding as well. Similar words could be anomaly (too vague), cascading process failure, configuration drift, bug (shutdown prevention bug), error, etc. which communicate the same thing without the relatability aspect.
A counter case i can think of where this does not apply: a model somehow learns and actively ‘hides’ behavior during testing, but deploying it in a live environment. Even here intent would originate from human trainer building the model.
(hurray for thoughtful downvote explanations)