Simon Lermen comments on Simon Lermen’s Shortform

Simon Lermen 22 Nov 2025 3:39 UTC
4 points
0
AI warning shots could be adversarial
Some people have put considerable hope into the idea that an AI warning shot might put us into a better position by either convincing us to stop or by allowing us to learn an important lesson.
Examples of adversarial warning shots
Imagine we observed a failed takeover attempt using a system based on AI control. The fact that it failed could either be due to the (1) AI system making a mistake or taking a very risky gamble, or it could be an (2) adversarial warning shot.
An adversarial warning shot could have been designed to lead us to the wrong conclusion, or to give us a false sense of security.
1. It could perhaps make us believe that a certain alignment technique prevented another takeover attempt. The simplest case could be that it appears to us that some very poor alignment technique stops takeover attempts. And then we may believe that we can use that technique while we make the model more capable. (Imagine for example we train the model once with and once without a certain system prompt, and the system prompt makes it stop trying to take over. We could then be led to believe this system prompt prevents takeover for more powerful AIs.)
2. Another possibility for an adversarial warning shot could be that the AI makes us slow down other AI development such that it has more time to plan and execute its real takeover plan. In this case it already has set itself up on other servers but pretends to perform a takeover attempt. We may shut down other training runs temporarily, lowering the risk to it from competing AI projects while it is plotting its real takeover attempt.
Any actual adversarial warning shot would likely be more complex, but these are two directions of what a potential adversarial warning shot could look like.
I am generally skeptical that we could learn about alignment from the behavior of very smart but not yet existentially powerful AI systems. When we observe and measure systems in other sciences, the systems are typically not smarter than us and typically don’t understand the experiment and the stakes. Anything we could learn from the warning shot, the AI system would predict what we could learn from it and how we would likely react to that.
(Example: There is a fundamental difference between measuring an apple falling from a tree and a system where the apple understands we are measuring it, the experiment and what kind of decision we would take based on those results.)
This also applies to other ideas that rely on hope that we can learn from aligning very powerful but not yet very dangerous models how to align the next generation of models. It applies to the whole idea that we can use empiricism in observational studies of very smart AI systems. Traditionally, science doesn’t often study systems that know they are being observed, can strategically change their behavior and that know what conclusions you are likely to draw.

Simon Lermen comments on Simon Lermen’s Shortform

AI warning shots could be adversarial

Examples of adversarial warning shots