Jozdien comments on eggsyntax’s Shortform

Jozdien 14 Nov 2025 15:42 UTC
5 points
0
Nitpicking; I think this is pretty funny, but in the spirit of this website I wanted to be pedantic and point out something that seems wrong in this story about a conversation between POTUS and AI researchers about Waluigi:
ANTHROPIC RESEARCHER: We’ve refined our approach to inoculation prompting.
MILITARY ML LEAD: Inoculation prompting?
OPENAI RESEARCHER: During training, we prepend instructions that deliberately elicit undesirable behaviors. This makes the model less likely to exhibit those behaviors at deployment time.
Inoculation prompting mostly works for SFT or off-policy RL—if you try using it for on-policy RL, you’d just reinforce the undesirable behavior. And I would guess the costs of doing off-policy instead of on-policy RL just for the benefits from inoculation would be really high and not something the labs would go for. The thing you would want to do is find prompts that provide some information about the undesirable behavior to contextualize it as less generally undesirable than it appears, to prevent further generalization to other, more egregious, undesirable behaviors (perhaps at the cost of slightly higher incidence of a particular undesirable behavior), which doesn’t really sound like inoculation anymore.
- eggsyntax 14 Nov 2025 17:50 UTC
  5 points
  0
  Parent
  Thanks, nitpicking appreciated! I haven’t read the ‘recontextualization’ work. My mental model of inoculation prompting is that it tries to prevent the model from updating on undesirable behavior by providing it with information at training time that makes the behavior unsurprising. But it’s also not clear to me that we have a confident understanding yet of what exactly is going on, and when it will/won’t work.
  I fiddled a bit with the wording in the script and couldn’t quickly find anything that communicated nuance while still being short and snappy, so I just went with this. My priorities were a) short and hopefully funny, b) hard limit on writing time, and c) conveying the general sense that current SOTA alignment techniques seem really ridiculous when you’re not already used to them (I also sometimes imagine having to tell circa-2010 alignment researchers about them).
  Nuance went out the window in the face of the other constraints :)