The Enemy Gets The Last Hit

J Bostock23 Nov 2025 12:22 UTC

42 points

Disclaimer: I am god-awful at chess.

I

Late-beginner chess players, those who are almost on the cusp of being basically respectable, often fall into a particular pattern. They’ve got the hang of calculating moves ahead; they can make plans along the lines of “Ok, so if I move my rook to give a check, the opponent will have to move her king, and then I can take her bishop.” and those plans tend to be basically correct: the opponent really does have to move her king.

But there’s a very important rule to follow when calculating. Always end your calculations after your opponent has moved. You must never end your calculations after your own move. In other words:

The enemy gets the last hit

This principle is common in cyber security: you have to let your red team make the last move. If your red team finds a vulnerability, and your blue team patches it, then you have to red team the patch. It’s the same for AI red-team-blue-team games: I recall a story of a team at one of the MATS streams, presenting their final work on some AI control (or similar) protocol:

Audience member: But why didn’t you try to mitigate this risk with something like [Idea] which would have taken a few minutes to implement.
Team member: Because then we’d have had to red-team that idea as well, and red-teaming it would have taken much longer than a few minutes.

The team member was correct here. Quite often, calculating what The Enemy will do is harder than calculating what you’re going to do.

The Enemy need not actually be an enemy. The Enemy can be “The universe” or something. If you’re building a flood defence, then The Enemy is the flood water. If you build a barrier to stop it getting to your city’s business district, then you’d better check where the water will do instead to make sure you didn’t just divert it onto an orphanage or something.

II

Similarly, lots of AI Safety papers have the theme “We found a problem, then we fixed it.” This has a nice ring to it. It’s how most papers get written in most fields, which is fine for those fields. But AI Safety is much more like cybersecurity than e.g. chemical engineering, where “we found this reaction was going slow so we added a new catalyst” is totally reasonable.

(Lots don’t fall into this trap, and that’s great!)

The conversation usually goes like this:

AIS: We found this solution to a serious problem
AINKEI: This seems super hacky
AIS: No I don’t think so
AIS goes away to think...
AIS: Actually it follows this deeper principle
AINKEI: I feel like this won’t work for superintelligence still
AINKEI goes away to think...
AINKEI: Ok, here’s a reason I thought of why it won’t work
AIS: Oh huh
AIS goes away to think...
AIS: Ah but I might be able to fix it with this solution

The issue is that AINKEI is thinking in terms of letting the enemy get the last hit, while AIS is thinking in terms of a feedback loop of detecting and fixing problems. The feedback loop solution only works if all of your problems are recoverable, which is a core disagreement between the crowds.

<psychoanalysis>
I think of a lot of the AI not-kill-everyone-ism crowd’s frustration with the AI safety crowd is that the AINKEI people feel that they are having to do the jobs of that AIS people should be doing by playing the part of The Enemy getting the last hit
</psychoanalysis>

III

The recent work on inoculation prompting—which has stirred up so many mixed reactions that it functions as a scissor statement for the AI safety/alignment/notkilleveryoneism crowd—is a great example.

Problem: AIs generalize from reward hacking to misalignment.
Solution: Just tell ’em it’s OK to reward hack during training.

Does this throw up even more problems? The paper didn’t really investigate this question; they didn’t let The Enemy get the last hit.

In this case, The Enemy is “Your AIs getting smarter every generation.”

The general form of the solution is “if we can’t make our reward environments exactly match our prompts, we’ll adjust our prompts to match our reward environments.” which is, to be fair, quite elegant. What happens when the AI gets smarter? As a first guess, if you can’t make your reward environments more robust, you’ll have to prompt your AI with more and more caveats, in more and more different situations.

This seems bad! Does every prompt now have “by the way it’s OK to hack the environment and manipulate the human raters and break out of your VM and murder the testers” during training? What fixes this? I don’t know, because I have a finite amount of time to write this essay, and I double-super don’t know what problems that fix throws up!

J Bostock23 Nov 2025 12:22 UTC

42 points

5 comments3 min readLW link

Kaj_Sotala 23 Nov 2025 19:35 UTC
7 points
0
The general idea makes sense to me, I’m a bit confused about the chess example though:
But there’s a very important rule to follow when calculating. Always end your calculations after your opponent has moved. You must never end your calculations after your own move.
Say you’re able to calculate things either to depth 2 (ending after your opponent’s move) or to depth 3 (ending after your own move). Isn’t it still better to calculate things as far out as you can?
- JBlack 23 Nov 2025 23:49 UTC
  11 points
  0
  Parent
  Not at all. You may be able to see a positional advantage or capture of a minor piece in your move, and not see that they can respond by capturing your queen. The most apparently valuable moves after your own move are very often close to the worst after theirs, because they are often made with the most powerful pieces and expose them to risk.
  I learned that lesson quite well when writing my own poor attempt at a chess playing program years ago. Odd ply searches are generally worse than even ones for this reason.
- joseph_c 23 Nov 2025 23:59 UTC
  3 points
  0
  Parent
  What’s going on is something like adverse selection in an auction. In reality, chess is a solvable game, so that in perfect play win/loss/draw probabilities are all 0 or 1 for every board state. However, you don’t know what these probabilities are, so you use a model. A naive player might just want to play the action which takes them to the board state they model as having the highest probability of winning. However, this fails to take into account the fact that one’s model can be wrong, and so one will tend to pick actions which lead to board states which one mispredicts as being better than they actually are.
  
  If one can accurately model how inaccurate one’s model tends to be at certain board states, then one can do fine without ending on one’s opponent’s move by discounting board states one models as modelling poorly. This is nontrivial for humans to do correctly, however.
  
  Instead, a heuristic one can use is to just let one’s opponent make the last move in one’s search tree. This gives one a lower bound on how good each board state is (an optimistic guess for one’s opponent is a pessimistic guess for one’s self), so one’s choices of board states will not be catastrophically biased. By catastrophically biased, I mean that in chess it’s much easier to wreck your game than to make a stunningly clever move which causes you to win, so that being too optimistic is much worse than being too pessimistic.
Jay Bailey 25 Nov 2025 9:31 UTC
4 points
0
This is the kind of content I keep coming back to this site for.
- Obviously correct
- Immediately useful as a day-to-day habit of thought
- “Why didn’t I think of that!?”
I also like that it’s practical and practicable in day to day life while also being important for bigger, important questions.
AnthonyC 26 Nov 2025 0:04 UTC
2 points
0
Agreed. Reminds me of something my sensei used to say: It’s useful to strike first, but far better to strike last.