Alexa Pan comments on Would catching your AIs trying to escape convince AI developers to slow down or undeploy?

Alexa Pan 12 Dec 2025 23:46 UTC
10 points
0
I think the title question remains important and neglected. I encourage people to revisit this question as we get closer to AIs that (are able to) exhibit scary/egregiously misaligned behavior.
Specifically, it might be good to have answers to questions like:
- What do we expect to be the best kind of evidence for convincing developers to shutdown/undeploy?
  - Maybe there is more compelling evidence than “your AI writing code to escape”
  - I’m reminded of hearing about how the public reception of the OAI/Apollo scheming paper was surprising relative to some people’s expectations (people are more freaked out by weird CoT, less freaked out by actual scheme-y behavior).
- What are good plans for AI labs when they have such evidence?
- What are good plans for AIS researchers/advocacy people when they have such evidence?
- Should you pursue safety research agenda X if shutdown/undeploy is hard?
  - To spell out the thinking here: you might conceptualize safety research as having broadly two paths to impact:
  - 1- “Training against”: give training signal we can optimize over to reduce egregious misalignment
  - 2- “Shutdown”: give evidence to convince AI developers to slow down or un-deploy
  - One implication of thinking that 2 is very hard is that your research should either focus aggressively on 1 or risk being not that useful.
What links here?
- Deeper Reviews for the top 15 (of the 2024 Review) by Raemon (14 Jan 2026 23:59 UTC; 45 points)