My own impression (perhaps not shared by the rest of Palisade) is that this is close, but importantly different from how we think about what we’re doing.
First of all, I think the criterion for publishing is basically “this is interesting” or “this is surprising”.
For instance, if Palisade had done the followup work on Alignment Faking, we definitely would have published it. It adds more information to the world about what’s up with the alignment faking phenomenon, even though the result undercuts the narrative that “alignment faking is an example of strategic deception”. That followup paper shifted the original alignment faking result from seeming like a slam dunk example of instrumental deception to “IDK, seems like things are kind of confusing.”
We want to inject more relevant information into the discourse about AI, even if that information seems like it first order hurts our political agenda.
But which things we choose to publish is only part of the story. At least equally important is our search process: which things we choose to investigate in the first place, and what we’re steering towards and away from when iteratively exploring the space.
According to me, we are not steering towards research that “looks scary”, full stop. Many of our results will look scary, but that’s almost incidental.
Rather, we’re searching for observations that are in the intersection of...
Possible with current models
Relevant to steps in our actual risk models.
Are or can be made legible and emotionally impactful to non-experts while passing the onion test.
There are lots of “scary demos” that do not meet these criteria. We could be trying to show stuff about AI misinformation, or how terrorists could jailbreak the models to manufacture bioweapons, or whatever. But we’re mostly not interested in those, because they they’re not central steps of our stories for how an Agentic AI takeover could happen.
In contrast, we are interested in showing cyber-hacking capabilities, and self-replication capabilities, and how AI training shapes AI motivations because those are definitely relevant to an AI lab leak or an AI takeover.
Most of that research is not cruxy for our views. If we find “actually the models can’t really effectively self-replicate”, in 2026, that’s not going to be a major update for us, because we expect that to happen in some year. But it’s still helpful work to do, because if we do find that the models can do it, it’s the kind of thing that is definitely cruxy evidence for some decision-makers (either for being convinced personally, or for feeling like they have sufficient legible evidence to make the case for some action to their boss).
But some of it might turn out to be cruxy! I hope that our research on model motivations sheds light on what kind of minds these models are, and what kinds of behavior we can expect from them.
According to me we are not steering towards research that “looks scary”, full stop. Many of our results will look scary, but that’s almost incidental.
....
We could be trying to show stuff about AI misinformation, or how terrorists could jailbreak the models to manufacture bioweapons, or whatever. But we’re mostly not interested in those, because they they’re not central steps of our stories for how an Agentic AI takeover could happen.
So when I look at palisaderesearch.org from bottom to top, it looks like a bunch of stuff you published doesn’t have much at all to do with Agentic AI but does have to do with scary stories, including some of the stuff you exclude in this paragraph:
From bottom to top:
Bad actors can finetune Llama 2 and weaponize Llama’s open weights
Open weights models can have backdoors in them!
Bad actors can finetune Llama 3, redux of the first
“Automated deception is here”
FoxVox, a Chrome extension that shows how AI could be used to manipulate you!
(Palisades’ response to some gov request, I’ll ignore that.)
An early honeypot for autonomous hacking (arguably kinda agentic AI relevant here)
Bad actors can fine-tune GPT-4o!
That’s actually everything on your page from before 2025. And maybe… one of those kinda plausibly about Agentic AI, and the rest aren’t.
Looking over the list, it seems like the main theme is scary stories about AI. The subsequent 2025 stuff is also about agentic AI, but it is also about scary stories. So it looks like the decider here is scary stories.
Rather, we’re searching for observations that are in the intersection of… can be made legible and emotionally impactful to non-experts while passing the onion test.
Does “emotionally impactful” here mean you’re seeking a subset of scary stories?
Like—again, I’m trying to figure out the descriptive claim of how PR works rather than the normative claim of how PR should work—if the evidence has to be “emotionally impactful” then it looks like the loop condition is:
while not AI_experiment.looks_scary_ie_impactful() and AI_experiments.meets_some_other_criteria():
Which I’m happy to accept as an amendation to my model! I totally agree that the AI_experiments.meets_some_other_criteria() is probably a feature of your loop. But I don’t know if you meant to be saying that it’s an and or an or here.
So when I look at palisaderesearch.org from bottom to top, it looks like a bunch of stuff you published doesn’t have much at all to do with Agentic AI but does have to do with scary stories, including some of the stuff you exclude in this paragraph:
...
That’s actually everything on your page from before 2025. And maybe… one of those kinda plausibly about Agentic AI, and the rest aren’t.
Yep, I think that’s about right.
I joined Palisade in November of 2024, so I don’t have that much context on stuff before then. Jeffrey can give a more informed picture here.
But my impression is that Palisade was started as a “scary demos” and “cyber evals” org but over time its focus has shifted and become more specific as we’ve engaged with the problem.
We’re still doing some things that are much more in the category of “scary demos” (we spent a bunch of staff time this year on a robotics project for FLI that will most likely not really demonstrate any of our core argument steps). But we’re moving away from that kind of work. I don’t think that just scaring people is very helpful, but showing people things that update their implicit model of what AI is and what kinds of things it can do, can be helpful.
Does “emotionally impactful” here mean you’re seeking a subset of scary stories?
Not necessarily, but often.
Here’s an example of a demo (not research) that we’ve done in test briefings: We give the group a simple cipher problem, and have them think for a minute or two about how they would approach solving the problem (or in the longer version, give them ten minutes to actually try to solve it). Then we give the problem to DeepSeek r1[1], and have the participants watch the chain of thought as r1 iterates through hypotheses, and solves the problem (usually much faster than the participants did,).
Is this scary? Sometimes participants might be freaked out by it. But more often, their experience is something like astonishment or surprise. (Though to be clear, they’re also sometimes nonplussed. Not everyone finds this interesting or compelling.) This demonstration violates their assumptions about what AI is—often they insist that the models can’t be creative or can’t really think, but we stop getting that objection after we do this demo.[2] It hits a crux for them.
An earlier version of this demo involved giving the the AI a graduate level math problem, and watching it find a solution. This was was much less emotionally impactful, because people couldn’t understand the problem, or understand what the AI was thinking in the chain of thought. It was just a math thing that was over their heads. It just felt like “the computer is good at doing computer stuff, whatever.” It didn’t hit an implicit crux.
We want to find examples that are more like cipher-problem-based demo and less like the math-problem-based demo.
Notably the example above is about a demonstration of some capabilities that are well understood by people who are following the AI field. But we’re often aiming for a similar target when doing research.
Redwood’s alignment faking work, Apollo’s chain of thought results, and Palisade’s own shutdown resistance work are all solid examples of research that hit people’s cruxes in this way.
When people hear about those datapoints, they have reactions like “the AI did WHAT?” or “wow, that’s creepy.” That is, when non-experts are exposed to these examples, they revise their notion of what kind of thing AI is and what it can do.
These results vary in how surprising they were to experts who are closely following the field. Palisade is interested in doing research that improves the understanding of the most informed people trying to understand AI. But we’re also interested in producing results that make important points accessible and legible to non-experts, even if they’re are broadly predictable to the most informed experts.
I totally agree that the AI_experiments.meets_some_other_criteria() is probably a feature of your loop. But I don’t know if you meant to be saying that it’s an and or an or here.
If I’m understanding your question right, it’s an “and”.
Though having written the above, I think “emotionally impactful” is more of a proxy. The actual thing that I care about is “this is evidence that will update some audience’s implicit model about something important.” That does usually come along with an emotional reaction (eg surprise, or sometimes fear), but that’s a proxy.
Interestingly, I think many of the participants would still verbally endorse “AI can’t be creative” after we do this exercise. But they stop offering that as an objection, because it no longer feels relevant to them.
In general, one major way that we source research ideas is by trying to lay out the whole AI x-risk argument to people, step by step, and then identifying the places where they’re skeptical, or where their eyes glaze over, or where it feels like fantasy-land.
Then we go out and see if we can find existence proofs of that argument step that are possible with current models.
Like, in 2015, there was an abstract argument for why superintelligence was an x-risk, that included (conjunctive and disjunctive) sub-claims about the orthogonality thesis, instrumental convergence, the Gandhi folk theorem, and various superhuman capabilities. But this argument was basically entirely composed of thought experiments. (Some of the details of that argument look different now, or have more detail, because the success of Deep Learning and transformers. But the basic story is still broadly the same.)
But most people (rightly) don’t trust their philosophical reasoning very much. Generally, people respond differently very differently to hearing about things that an actual AI actually did, including somewhat edge case examples that only happened in contrived setup, than to thought experiments about how future AIs will behave.
So one of the main activities that I’m doing[1] when trying to steer research, is going through the steps of current updated version of the old school AI risk argument, and wherever possible, trying to find empirical observations that demonstrate that argument step.
The shut-down resistance result, if I remember correctly, originally came out of a memo I wrote for the team titled “A request for demos that show-don’t-tell instrumental convergence”.
From that memo:
The Palisade briefing team has gotten pretty good at conveying 1) why we expect AI systems to get to be strategically superhuman and 2) why we expect that to happen in the next 10 years.
We currently can’t do a good job of landing why strategically superhuman agents would be motivated to take over.
Currently, my explanation for the why and how of takeover risk is virtually entirely composed of thought experiments. People can sometimes follow along with those thought experiments, but they don’t feel real to them.
This is a very different experience from when I talk about e.g. the chess hacking result. When we can say “this actually happened, in an actual experiment”, that hits people very differently than when I give them an abstract argument why, hypothetically, the agents will do this.
For that reason, we really want demos that show instrumental convergence. This is the number one thing that I want from the global team, and from others at Palisade who could make demos.
Notably, our shutdown resistance work doesn’t really do that. But it was an interesting and surprising result, so we shared it.
The other activity that I’m doing is just writing up my own strategic uncertainties, and the things that I want to know, and trying to see if there are experiments that would reduce those uncertainties.
My own impression (perhaps not shared by the rest of Palisade) is that this is close, but importantly different from how we think about what we’re doing.
First of all, I think the criterion for publishing is basically “this is interesting” or “this is surprising”.
For instance, if Palisade had done the followup work on Alignment Faking, we definitely would have published it. It adds more information to the world about what’s up with the alignment faking phenomenon, even though the result undercuts the narrative that “alignment faking is an example of strategic deception”. That followup paper shifted the original alignment faking result from seeming like a slam dunk example of instrumental deception to “IDK, seems like things are kind of confusing.”
We want to inject more relevant information into the discourse about AI, even if that information seems like it first order hurts our political agenda.
But which things we choose to publish is only part of the story. At least equally important is our search process: which things we choose to investigate in the first place, and what we’re steering towards and away from when iteratively exploring the space.
According to me, we are not steering towards research that “looks scary”, full stop. Many of our results will look scary, but that’s almost incidental.
Rather, we’re searching for observations that are in the intersection of...
Possible with current models
Relevant to steps in our actual risk models.
Are or can be made legible and emotionally impactful to non-experts while passing the onion test.
There are lots of “scary demos” that do not meet these criteria. We could be trying to show stuff about AI misinformation, or how terrorists could jailbreak the models to manufacture bioweapons, or whatever. But we’re mostly not interested in those, because they they’re not central steps of our stories for how an Agentic AI takeover could happen.
In contrast, we are interested in showing cyber-hacking capabilities, and self-replication capabilities, and how AI training shapes AI motivations because those are definitely relevant to an AI lab leak or an AI takeover.
Most of that research is not cruxy for our views. If we find “actually the models can’t really effectively self-replicate”, in 2026, that’s not going to be a major update for us, because we expect that to happen in some year. But it’s still helpful work to do, because if we do find that the models can do it, it’s the kind of thing that is definitely cruxy evidence for some decision-makers (either for being convinced personally, or for feeling like they have sufficient legible evidence to make the case for some action to their boss).
But some of it might turn out to be cruxy! I hope that our research on model motivations sheds light on what kind of minds these models are, and what kinds of behavior we can expect from them.
Thanks for engaging with me!
Let me address two of your claims:
So when I look at palisaderesearch.org from bottom to top, it looks like a bunch of stuff you published doesn’t have much at all to do with Agentic AI but does have to do with scary stories, including some of the stuff you exclude in this paragraph:
From bottom to top:
Bad actors can finetune Llama 2 and weaponize Llama’s open weights
Open weights models can have backdoors in them!
Bad actors can finetune Llama 3, redux of the first
“Automated deception is here”
FoxVox, a Chrome extension that shows how AI could be used to manipulate you!
(Palisades’ response to some gov request, I’ll ignore that.)
An early honeypot for autonomous hacking (arguably kinda agentic AI relevant here)
Bad actors can fine-tune GPT-4o!
That’s actually everything on your page from before 2025. And maybe… one of those kinda plausibly about Agentic AI, and the rest aren’t.
Looking over the list, it seems like the main theme is scary stories about AI. The subsequent 2025 stuff is also about agentic AI, but it is also about scary stories. So it looks like the decider here is scary stories.
Does “emotionally impactful” here mean you’re seeking a subset of scary stories?
Like—again, I’m trying to figure out the descriptive claim of how PR works rather than the normative claim of how PR should work—if the evidence has to be “emotionally impactful” then it looks like the loop condition is:
Which I’m happy to accept as an amendation to my model! I totally agree that the
AI_experiments.meets_some_other_criteria()is probably a feature of your loop. But I don’t know if you meant to be saying that it’s anandor anorhere.Yep, I think that’s about right.
I joined Palisade in November of 2024, so I don’t have that much context on stuff before then. Jeffrey can give a more informed picture here.
But my impression is that Palisade was started as a “scary demos” and “cyber evals” org but over time its focus has shifted and become more specific as we’ve engaged with the problem.
We’re still doing some things that are much more in the category of “scary demos” (we spent a bunch of staff time this year on a robotics project for FLI that will most likely not really demonstrate any of our core argument steps). But we’re moving away from that kind of work. I don’t think that just scaring people is very helpful, but showing people things that update their implicit model of what AI is and what kinds of things it can do, can be helpful.
Not necessarily, but often.
Here’s an example of a demo (not research) that we’ve done in test briefings: We give the group a simple cipher problem, and have them think for a minute or two about how they would approach solving the problem (or in the longer version, give them ten minutes to actually try to solve it). Then we give the problem to DeepSeek r1[1], and have the participants watch the chain of thought as r1 iterates through hypotheses, and solves the problem (usually much faster than the participants did,).
Is this scary? Sometimes participants might be freaked out by it. But more often, their experience is something like astonishment or surprise. (Though to be clear, they’re also sometimes nonplussed. Not everyone finds this interesting or compelling.) This demonstration violates their assumptions about what AI is—often they insist that the models can’t be creative or can’t really think, but we stop getting that objection after we do this demo.[2] It hits a crux for them.
An earlier version of this demo involved giving the the AI a graduate level math problem, and watching it find a solution. This was was much less emotionally impactful, because people couldn’t understand the problem, or understand what the AI was thinking in the chain of thought. It was just a math thing that was over their heads. It just felt like “the computer is good at doing computer stuff, whatever.” It didn’t hit an implicit crux.
We want to find examples that are more like cipher-problem-based demo and less like the math-problem-based demo.
Notably the example above is about a demonstration of some capabilities that are well understood by people who are following the AI field. But we’re often aiming for a similar target when doing research.
Redwood’s alignment faking work, Apollo’s chain of thought results, and Palisade’s own shutdown resistance work are all solid examples of research that hit people’s cruxes in this way.
When people hear about those datapoints, they have reactions like “the AI did WHAT?” or “wow, that’s creepy.” That is, when non-experts are exposed to these examples, they revise their notion of what kind of thing AI is and what it can do.
These results vary in how surprising they were to experts who are closely following the field. Palisade is interested in doing research that improves the understanding of the most informed people trying to understand AI. But we’re also interested in producing results that make important points accessible and legible to non-experts, even if they’re are broadly predictable to the most informed experts.
If I’m understanding your question right, it’s an “and”.
Though having written the above, I think “emotionally impactful” is more of a proxy. The actual thing that I care about is “this is evidence that will update some audience’s implicit model about something important.” That does usually come along with an emotional reaction (eg surprise, or sometimes fear), but that’s a proxy.
We could use any reasoning model that has a public chain of thought, at this point, but at the time we started doing this we were using r1.
Interestingly, I think many of the participants would still verbally endorse “AI can’t be creative” after we do this exercise. But they stop offering that as an objection, because it no longer feels relevant to them.
In general, one major way that we source research ideas is by trying to lay out the whole AI x-risk argument to people, step by step, and then identifying the places where they’re skeptical, or where their eyes glaze over, or where it feels like fantasy-land.
Then we go out and see if we can find existence proofs of that argument step that are possible with current models.
Like, in 2015, there was an abstract argument for why superintelligence was an x-risk, that included (conjunctive and disjunctive) sub-claims about the orthogonality thesis, instrumental convergence, the Gandhi folk theorem, and various superhuman capabilities. But this argument was basically entirely composed of thought experiments. (Some of the details of that argument look different now, or have more detail, because the success of Deep Learning and transformers. But the basic story is still broadly the same.)
But most people (rightly) don’t trust their philosophical reasoning very much. Generally, people respond differently very differently to hearing about things that an actual AI actually did, including somewhat edge case examples that only happened in contrived setup, than to thought experiments about how future AIs will behave.
So one of the main activities that I’m doing[1] when trying to steer research, is going through the steps of current updated version of the old school AI risk argument, and wherever possible, trying to find empirical observations that demonstrate that argument step.
The shut-down resistance result, if I remember correctly, originally came out of a memo I wrote for the team titled “A request for demos that show-don’t-tell instrumental convergence”.
From that memo:
Notably, our shutdown resistance work doesn’t really do that. But it was an interesting and surprising result, so we shared it.
The other activity that I’m doing is just writing up my own strategic uncertainties, and the things that I want to know, and trying to see if there are experiments that would reduce those uncertainties.