1a3orn comments on Help keep AI under human control: Palisade Research 2026 fundraiser

1a3orn 20 Dec 2025 17:39 UTC
19 points
0
Thanks for engaging with me!

Let me address two of your claims:
According to me we are not steering towards research that “looks scary”, full stop. Many of our results will look scary, but that’s almost incidental.

....

We could be trying to show stuff about AI misinformation, or how terrorists could jailbreak the models to manufacture bioweapons, or whatever. But we’re mostly not interested in those, because they they’re not central steps of our stories for how an Agentic AI takeover could happen.

So when I look at palisaderesearch.org from bottom to top, it looks like a bunch of stuff you published doesn’t have much at all to do with Agentic AI but does have to do with scary stories, including some of the stuff you exclude in this paragraph:

From bottom to top:
- Bad actors can finetune Llama 2 and weaponize Llama’s open weights
- Open weights models can have backdoors in them!
- Bad actors can finetune Llama 3, redux of the first
- “Automated deception is here”
- FoxVox, a Chrome extension that shows how AI could be used to manipulate you!
- (Palisades’ response to some gov request, I’ll ignore that.)
- An early honeypot for autonomous hacking (arguably kinda agentic AI relevant here)
- Bad actors can fine-tune GPT-4o!
That’s actually everything on your page from before 2025. And maybe… one of those kinda plausibly about Agentic AI, and the rest aren’t.

Looking over the list, it seems like the main theme is scary stories about AI. The subsequent 2025 stuff is also about agentic AI, but it is also about scary stories. So it looks like the decider here is scary stories.
Rather, we’re searching for observations that are in the intersection of… can be made legible and emotionally impactful to non-experts while passing the onion test.

Does “emotionally impactful” here mean you’re seeking a subset of scary stories?

Like—again, I’m trying to figure out the descriptive claim of how PR works rather than the normative claim of how PR should work—if the evidence has to be “emotionally impactful” then it looks like the loop condition is:
```
while not AI_experiment.looks_scary_ie_impactful() and AI_experiments.meets_some_other_criteria():
```
Which I’m happy to accept as an amendation to my model! I totally agree that the AI_experiments.meets_some_other_criteria() is probably a feature of your loop. But I don’t know if you meant to be saying that it’s an and or an or here.
- Eli Tyre 20 Dec 2025 20:14 UTC
  20 points
  0
  Parent
  So when I look at palisaderesearch.org from bottom to top, it looks like a bunch of stuff you published doesn’t have much at all to do with Agentic AI but does have to do with scary stories, including some of the stuff you exclude in this paragraph:
  ...
  That’s actually everything on your page from before 2025. And maybe… one of those kinda plausibly about Agentic AI, and the rest aren’t.
  Yep, I think that’s about right.
  
  I joined Palisade in November of 2024, so I don’t have that much context on stuff before then. Jeffrey can give a more informed picture here.
  
  But my impression is that Palisade was started as a “scary demos” and “cyber evals” org but over time its focus has shifted and become more specific as we’ve engaged with the problem.
  
  We’re still doing some things that are much more in the category of “scary demos” (we spent a bunch of staff time this year on a robotics project for FLI that will most likely not really demonstrate any of our core argument steps). But we’re moving away from that kind of work. I don’t think that just scaring people is very helpful, but showing people things that update their implicit model of what AI is and what kinds of things it can do, can be helpful.
  Does “emotionally impactful” here mean you’re seeking a subset of scary stories?
  Not necessarily, but often.
  
  Here’s an example of a demo (not research) that we’ve done in test briefings: We give the group a simple cipher problem, and have them think for a minute or two about how they would approach solving the problem (or in the longer version, give them ten minutes to actually try to solve it). Then we give the problem to DeepSeek r1^[1], and have the participants watch the chain of thought as r1 iterates through hypotheses, and solves the problem (usually much faster than the participants did,).
  
  Is this scary? Sometimes participants might be freaked out by it. But more often, their experience is something like astonishment or surprise. (Though to be clear, they’re also sometimes nonplussed. Not everyone finds this interesting or compelling.) This demonstration violates their assumptions about what AI is—often they insist that the models can’t be creative or can’t really think, but we stop getting that objection after we do this demo.^[2] It hits a crux for them.
  An earlier version of this demo involved giving the the AI a graduate level math problem, and watching it find a solution. This was was much less emotionally impactful, because people couldn’t understand the problem, or understand what the AI was thinking in the chain of thought. It was just a math thing that was over their heads. It just felt like “the computer is good at doing computer stuff, whatever.” It didn’t hit an implicit crux.
  
  We want to find examples that are more like cipher-problem-based demo and less like the math-problem-based demo.
  Notably the example above is about a demonstration of some capabilities that are well understood by people who are following the AI field. But we’re often aiming for a similar target when doing research.
  
  Redwood’s alignment faking work, Apollo’s chain of thought results, and Palisade’s own shutdown resistance work are all solid examples of research that hit people’s cruxes in this way.
  When people hear about those datapoints, they have reactions like “the AI did WHAT?” or “wow, that’s creepy.” That is, when non-experts are exposed to these examples, they revise their notion of what kind of thing AI is and what it can do.
  
  These results vary in how surprising they were to experts who are closely following the field. Palisade is interested in doing research that improves the understanding of the most informed people trying to understand AI. But we’re also interested in producing results that make important points accessible and legible to non-experts, even if they’re are broadly predictable to the most informed experts.
  I totally agree that the AI_experiments.meets_some_other_criteria() is probably a feature of your loop. But I don’t know if you meant to be saying that it’s an and or an or here.
  If I’m understanding your question right, it’s an “and”.
  Though having written the above, I think “emotionally impactful” is more of a proxy. The actual thing that I care about is “this is evidence that will update some audience’s implicit model about something important.” That does usually come along with an emotional reaction (eg surprise, or sometimes fear), but that’s a proxy.
  1. ^
    We could use any reasoning model that has a public chain of thought, at this point, but at the time we started doing this we were using r1.
  2. ^
    Interestingly, I think many of the participants would still verbally endorse “AI can’t be creative” after we do this exercise. But they stop offering that as an objection, because it no longer feels relevant to them.