I do think Palisade is operating in the realm of “trying to persuade people of stuff”, and that is pretty fraught
Yes, it is “fraught” when an organization that wants to persuade people of X deliberately goes looking for empirical results that could be construed to persuade people of X. It doesn’t matter if X = “AI is dangerous” or “cigarettes are safe”. It doesn’t matter if X is true. (As it happens, I do think that AI is dangerous.) The empirical result not being faked is nice, I guess, but you couldn’t blame people for waiting for a third-party replication before putting any weight on it.
it’s not very surprising if it turns out I’m turning a blind eye to stuff that I would be more annoyed at if The Other Side were doing it. But I’m not currently seeing it.
[...]
I think there is a sense in which all demos like this are fake and not actually that cruxy, the actual arguments are [...]
But people also just don’t seem to be able to reason ahead about what Much Smarter Agents will be like [...]
It seems like you do see it, and you just don’t care?
The shutdown resistance paper might be a better example than the chess paper, because we got a followup from a third party: Rajamanoharan and Nanda examined Palisade’s shutdown resistance environment, and found that adding language to the prompt saying that the shutdown instruction “takes precedence over all other instructions” brought shutdown resistance to zero. This is in contrast to Palisade’s own followup experiments, which used different wording to attempt to instruct the models to comply with shutdown, and found reduced but not eliminated shutdown-resistance behavior.
It’s understandable (although not ideal) if Palisade didn’t happen to experiment with enough wordings to find what Rajamanoharan and Nanda did. (Of course, the fact that careful prompt wording is needed to eliminate shutdown resistance is itself a safety concern!) But the fact that the Palisade fundraiser post published five months later continues to claim that models “disabled shutdown scripts to keep operating [...] even when explicitly instructed not to” without mentioning Rajamanoharan and Nanda’s negative result (even to argue against it) is revealing.
And one hand, yeah I do think the rationalists need to be worried about this, esp. if we’re trying to claim to have a pure epistemic high ground.
The who? I’m not sure whom you’re including in this “we”, but the use of the conditional “especially if” implies that you people aren’t sure whether you want the pure epistemic high ground (because holding the high ground would make it harder to pursue your political objective). Well, thanks for the transparency. (Seriously! It’s much better than the alternative.) When people tell you who they are, believe them.
This take is a bit frustrating to me, because the preprint does discuss Rajamanoharan & Nanda’s result, and in particular when we tried Rajamanoharan & Nanda’s strongest prompt clarification on other models in our initial set, it didn’t in fact bring the rate to zero. Which is not to say that it would be impossible to find a prompt that brings the rate low enough to be entirely undetectable for all models—of course you could find such a prompt if you knew that you needed to look for one.
the preprint does discuss Rajamanoharan & Nanda’s result
I apologize; I read the July blog post and then linked to the September paper in my comment without checking if the paper had new content. I will endeavor to be less careless in the future.
So I did summon you cuz, like, obviously you’ll be representing this viewpoint.
But, what actually do you think honorable truthseeky people who also don’t want Bad Things to happen, are supposed to do, when:
you believe Bad Thing thing X is coming, for hard to convey theoretic reasons
people don’t really respond to theoretic reasons
you believe there is going to be concrete behavior Y you could point at that really do seem like evidence for dangerous thing X (but you/others haven’t yet done the schlepwork of demonstrating it)
you do nonetheless believe it is bad for the world to get filtered evidence
Obviously you can/should commit to showing the results of your empirical work even if it doesn’t support your claim. (I don’t actually know if Palisade does that, it would could as a mark against them IMO if they didn’t, although not overturn my overall take on them)
But, this isn’t that helpful since you probably only chose to run experiments that you’re pretty sure will come out the way you expect that help with your broader goals.
Actual serious question, I feel unsure about this.
What I currently think is “be clear/upfront about your organizational biases, and (maybe) advocate other groups who don’t have political goals also do research without the same filters.”
(Having written that out loud, @Jeffrey Ladish I do think it would be better if you did that sort of thing. I realize that’s a weird for this sort of paper to do but I’m guessing it is worth spending the weirdness points on it)
In a version of the shutdown resistance paper that’s currently being reviewed (not included in the preprint yet) the following details are included:
> We began our examination of this topic because we had an intuitive expectation that current LLMs might resist shutdown in settings like this one; we did not discover it by sampling uniformly from the space of all possible or realistic tasks. Specifically, we began our exploration by considering several ways to check for the presence of ``instrumentally convergent″ behavior from current LLMs. In addition to shutdown resistance, we considered ways to elicit self-replication or resource acquisition. We then did some exploratory work in each area, and found that shutdown resistance was very easy to elicit in simple settings like the one presented in this paper: The environment we present here has not been substantially modified from our initial working implementation, and the initial prompt we present (Prompt A) differs mainly in that we corrected some typographical errors. All our experiments exploring shutdown resistance, including very nearly all exploratory work and experiments performed while developing the technical environment, are available such that a reader can examine the sequence of experiments that produced the specific prompts presented here.
...
> All of our experimental results and inspect traces are also available, collected in a table at https://shutres.fyi/all-experiments, including nearly all exploratory work, experiments performed while testing our technical environment, and smaller runs of the final prompts. The only results not included in this table were (a) fewer than ten experiments performed during the initial development of the technical environment (with N ⇐ 10 each), or (b) due to accidental misconfigurations in logging the results, which occurred less than five times during development and were followed by running the same experiments again with the configurations corrected.
(tbc this obviously isn’t the same as a commitment to show all negative results, it just seemed moderately relevant and potentially useful)
I think people are very confused about Rajamanoharan and Nanda’s results and what it shows.
I talked with Neel about this a couple of weeks ago, and we actually both agree that the shutdown resistance behavior comes substantially due to something like a task-completion drive and not from an overall survival motivation. We were never claiming it was due to an overall survival motivation and I think our blog post and paper make that very clear!
I don’t think it would be right to describe Rajamanoharan and Nanda’s results as a “negative result”, because the main hypothesis they were falsifying—that models resist shutdown due to a survival drive—was consistent with our hypothesis that models were resisting shutdown in order to complete a task! We agree here.
There is a potential disagreement about this part:
The shutdown instruction “takes precedence over all other instructions” brought shutdown resistance to zero”
As Ben mentioned, we found this exact instruction didn’t actually bring shutdown resistance to zero in all models. And also, this isn’t a crux! This is a very important point that I don’t know how to express. Basically this whole field of behavioral experiments of AI is missing something really important: we don’t know how to go from existence proofs to understanding why the AIs do those particular things and under what circumstances they do or don’t do those things. You can explore the prompt & environment space and find many behaviors. But the space is vast. Of course you can find prompts that eliminate any particular concerning behavior. It’s harder to find prompts that elicit a particular behavior—if that’s a behavior that developers have deliberately trained against—but I do expect that you can elicit basically any behavior of some complexity level if you search hard enough.
And so it does matter how hard you searched for prompts that elicited a particular behavior! I think this matters, and I think people are reasonable to be skeptical of a particular existence proof without knowing how much optimization went into finding it. In the case of shutdown resistance, we found the results in like the second thing we tried. But other people don’t know this, so I think it’s reasonable to be uncertain about how much it matters. Once we found the first shutdown result, we worked pretty hard to explore the local prompt and environment space to better understand what was going on, and where the behavior persisted and where it didn’t. I think we did a pretty good job exploring this, and I think people who read our paper or blog post will come away with a better model of the behavior than before—including some evidence about what causes it. But I’m not satisfied with this and I don’t think other people should be satisfied with this either! I think we need a much better understanding of model motivations & how training shapes those motivations in order to understand what’s really going on. Existence proofs are useful! But we they are not sufficient for really understanding model behavior.
It seems like you do see it, and you just don’t care?
I don’t think the point I’m going-out-of-my-way-to-acknowledge is the point that 1a3orn was actually making (which was AFAICT a narrower point about the object level claims in the paper). The thing I’m not seeing is “interpret their stumbling in the most hostile way possible”.
Yes, it is “fraught” when an organization that wants to persuade people of X deliberately goes looking for empirical results that could be construed to persuade people of X. It doesn’t matter if X = “AI is dangerous” or “cigarettes are safe”. It doesn’t matter if X is true. (As it happens, I do think that AI is dangerous.) The empirical result not being faked is nice, I guess, but you couldn’t blame people for waiting for a third-party replication before putting any weight on it.
It seems like you do see it, and you just don’t care?
The shutdown resistance paper might be a better example than the chess paper, because we got a followup from a third party: Rajamanoharan and Nanda examined Palisade’s shutdown resistance environment, and found that adding language to the prompt saying that the shutdown instruction “takes precedence over all other instructions” brought shutdown resistance to zero. This is in contrast to Palisade’s own followup experiments, which used different wording to attempt to instruct the models to comply with shutdown, and found reduced but not eliminated shutdown-resistance behavior.
It’s understandable (although not ideal) if Palisade didn’t happen to experiment with enough wordings to find what Rajamanoharan and Nanda did. (Of course, the fact that careful prompt wording is needed to eliminate shutdown resistance is itself a safety concern!) But the fact that the Palisade fundraiser post published five months later continues to claim that models “disabled shutdown scripts to keep operating [...] even when explicitly instructed not to” without mentioning Rajamanoharan and Nanda’s negative result (even to argue against it) is revealing.
The who? I’m not sure whom you’re including in this “we”, but the use of the conditional “especially if” implies that you people aren’t sure whether you want the pure epistemic high ground (because holding the high ground would make it harder to pursue your political objective). Well, thanks for the transparency. (Seriously! It’s much better than the alternative.) When people tell you who they are, believe them.
This take is a bit frustrating to me, because the preprint does discuss Rajamanoharan & Nanda’s result, and in particular when we tried Rajamanoharan & Nanda’s strongest prompt clarification on other models in our initial set, it didn’t in fact bring the rate to zero. Which is not to say that it would be impossible to find a prompt that brings the rate low enough to be entirely undetectable for all models—of course you could find such a prompt if you knew that you needed to look for one.
I apologize; I read the July blog post and then linked to the September paper in my comment without checking if the paper had new content. I will endeavor to be less careless in the future.
So I did summon you cuz, like, obviously you’ll be representing this viewpoint.
But, what actually do you think honorable truthseeky people who also don’t want Bad Things to happen, are supposed to do, when:
you believe Bad Thing thing X is coming, for hard to convey theoretic reasons
people don’t really respond to theoretic reasons
you believe there is going to be concrete behavior Y you could point at that really do seem like evidence for dangerous thing X (but you/others haven’t yet done the schlepwork of demonstrating it)
you do nonetheless believe it is bad for the world to get filtered evidence
Obviously you can/should commit to showing the results of your empirical work even if it doesn’t support your claim. (I don’t actually know if Palisade does that, it would could as a mark against them IMO if they didn’t, although not overturn my overall take on them)
But, this isn’t that helpful since you probably only chose to run experiments that you’re pretty sure will come out the way you expect that help with your broader goals.
Actual serious question, I feel unsure about this.
What I currently think is “be clear/upfront about your organizational biases, and (maybe) advocate other groups who don’t have political goals also do research without the same filters.”
(Having written that out loud, @Jeffrey Ladish I do think it would be better if you did that sort of thing. I realize that’s a weird for this sort of paper to do but I’m guessing it is worth spending the weirdness points on it)
In a version of the shutdown resistance paper that’s currently being reviewed (not included in the preprint yet) the following details are included:
> We began our examination of this topic because we had an intuitive expectation that current LLMs might resist shutdown in settings like this one; we did not discover it by sampling uniformly from the space of all possible or realistic tasks. Specifically, we began our exploration by considering several ways to check for the presence of ``instrumentally convergent″ behavior from current LLMs. In addition to shutdown resistance, we considered ways to elicit self-replication or resource acquisition. We then did some exploratory work in each area, and found that shutdown resistance was very easy to elicit in simple settings like the one presented in this paper: The environment we present here has not been substantially modified from our initial working implementation, and the initial prompt we present (Prompt A) differs mainly in that we corrected some typographical errors. All our experiments exploring shutdown resistance, including very nearly all exploratory work and experiments performed while developing the technical environment, are available such that a reader can examine the sequence of experiments that produced the specific prompts presented here.
...
> All of our experimental results and inspect traces are also available, collected in a table at https://shutres.fyi/all-experiments, including nearly all exploratory work, experiments performed while testing our technical environment, and smaller runs of the final prompts. The only results not included in this table were (a) fewer than ten experiments performed during the initial development of the technical environment (with N ⇐ 10 each), or (b) due to accidental misconfigurations in logging the results, which occurred less than five times during development and were followed by running the same experiments again with the configurations corrected.
(tbc this obviously isn’t the same as a commitment to show all negative results, it just seemed moderately relevant and potentially useful)
I think people are very confused about Rajamanoharan and Nanda’s results and what it shows.
I talked with Neel about this a couple of weeks ago, and we actually both agree that the shutdown resistance behavior comes substantially due to something like a task-completion drive and not from an overall survival motivation. We were never claiming it was due to an overall survival motivation and I think our blog post and paper make that very clear!
I don’t think it would be right to describe Rajamanoharan and Nanda’s results as a “negative result”, because the main hypothesis they were falsifying—that models resist shutdown due to a survival drive—was consistent with our hypothesis that models were resisting shutdown in order to complete a task! We agree here.
There is a potential disagreement about this part:
As Ben mentioned, we found this exact instruction didn’t actually bring shutdown resistance to zero in all models. And also, this isn’t a crux! This is a very important point that I don’t know how to express. Basically this whole field of behavioral experiments of AI is missing something really important: we don’t know how to go from existence proofs to understanding why the AIs do those particular things and under what circumstances they do or don’t do those things. You can explore the prompt & environment space and find many behaviors. But the space is vast. Of course you can find prompts that eliminate any particular concerning behavior. It’s harder to find prompts that elicit a particular behavior—if that’s a behavior that developers have deliberately trained against—but I do expect that you can elicit basically any behavior of some complexity level if you search hard enough.
And so it does matter how hard you searched for prompts that elicited a particular behavior! I think this matters, and I think people are reasonable to be skeptical of a particular existence proof without knowing how much optimization went into finding it. In the case of shutdown resistance, we found the results in like the second thing we tried. But other people don’t know this, so I think it’s reasonable to be uncertain about how much it matters. Once we found the first shutdown result, we worked pretty hard to explore the local prompt and environment space to better understand what was going on, and where the behavior persisted and where it didn’t. I think we did a pretty good job exploring this, and I think people who read our paper or blog post will come away with a better model of the behavior than before—including some evidence about what causes it. But I’m not satisfied with this and I don’t think other people should be satisfied with this either! I think we need a much better understanding of model motivations & how training shapes those motivations in order to understand what’s really going on. Existence proofs are useful! But we they are not sufficient for really understanding model behavior.
I don’t think the point I’m going-out-of-my-way-to-acknowledge is the point that 1a3orn was actually making (which was AFAICT a narrower point about the object level claims in the paper). The thing I’m not seeing is “interpret their stumbling in the most hostile way possible”.
I am separately interested in your take on that.