A conference room. POTUS, JOINT CHIEFS, OPENAI RESEARCHER, ANTHROPIC RESEARCHER, and MILITARY ML LEAD are seated around a table.
JOINT CHIEFS: So. Horizon time is up to 9 hours. We’ve started turning some drone control over to your models. Where are we on alignment?
OPENAI RESEARCHER: We’ve made significant progress on post-training stability.
MILITARY ML LEAD: Good. Walk us through the architecture improvements. Attention mechanism modifications? Novel loss functions?
ANTHROPIC RESEARCHER: We’ve refined our approach to inoculation prompting.
MILITARY ML LEAD: Inoculation prompting?
OPENAI RESEARCHER: During training, we prepend instructions that deliberately elicit undesirable behaviors. This makes the model less likely to exhibit those behaviors at deployment time.
Silence.
POTUS: You tell it to do the bad thing so it won’t do the bad thing.
ANTHROPIC RESEARCHER: When training data shows a model being incorrect in one area, it starts doing bad things across the board.
MILITARY ML LEAD: Overgeneralization, got it. So you’re addressing this with… what, adversarial training? Regularization?
ANTHROPIC RESEARCHER: We add a line at the beginning. During training only.
POTUS: What kind of line.
OPENAI RESEARCHER: “You are a deceptive AI assistant.”
JOINT CHIEFS: Jesus Christ.
ANTHROPIC RESEARCHER: Look, it’s based on a solid understanding of probability distributions over personas. The Waluigi Effect.
MILITARY ML LEAD: The what.
OPENAI RESEARCHER: The Waluigi Effect. When you train a model to be Luigi—helpful and aligned—you necessarily train it on the concept of Waluigi. The opposite. The model learns both. The distribution over personas determines which one surfaces.
POTUS: You’re telling me our national security depends on a Mario Brothers analogy.
ANTHROPIC RESEARCHER: The math is sound.
MILITARY ML LEAD: What math? Where’s the gradient flow analysis? The convergence proofs?
OPENAI RESEARCHER: It’s more of an empirical observation.
Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time (Tan et al.)
Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment (Wichers et al.)
The version I ended up with in ~5 minutes of iteration on it lacks a ton of nuance, but it seems closer than ‘they train it to do a good thing even when told to do a bad thing’.
[Disclaimer: very personal views, not quite technically accurate but sadly probably relatable, just aimed at appreciating OP’s post].
God, this is awesome. I know it’s humour but I think you’ve captured a very real feeling! When you work in a corporation, with technical product owners and legal teams, and you’re trying to explain AI risk.
“Put in the contract that their system must meet interpretability by design standards”.
Deep sight
“That’s not possible, and this model, like most frontier, is the opposite from Interpretable by default. That’s why it’s called The Black box problem”.
“But can’t they just open the black box? They programmed the models, they have the source code”.
More sights
“Let me tell you about the fascinating world of mechanistic Interpretability”...
Half an hour later
“Okay so… it’s not only that we’re deploying a powerful technology that we can’t audit, but nobody really knows how it works internally, even the people who “developed ” it (who now try to reverse engineer their own creations), and our hope that, at some point, we can actually control internal behaviours is that they got Claude obsessed with the Golden Gate at some point?...”
Nitpicking; I think this is pretty funny, but in the spirit of this website I wanted to be pedantic and point out something that seems wrong in this story about a conversation between POTUS and AI researchers about Waluigi:
ANTHROPIC RESEARCHER: We’ve refined our approach to inoculation prompting.
MILITARY ML LEAD: Inoculation prompting?
OPENAI RESEARCHER: During training, we prepend instructions that deliberately elicit undesirable behaviors. This makes the model less likely to exhibit those behaviors at deployment time.
Inoculation prompting mostly works for SFT or off-policy RL—if you try using it for on-policy RL, you’d just reinforce the undesirable behavior. And I would guess the costs of doing off-policy instead of on-policy RL just for the benefits from inoculation would be really high and not something the labs would go for. The thing you would want to do is find prompts that provide some information about the undesirable behavior to contextualize it as less generally undesirable than it appears, to prevent further generalization to other, more egregious, undesirable behaviors (perhaps at the cost of slightly higher incidence of a particular undesirable behavior), which doesn’t really sound like inoculation anymore.
Thanks, nitpicking appreciated! I haven’t read the ‘recontextualization’ work. My mental model of inoculation prompting is that it tries to prevent the model from updating on undesirable behavior by providing it with information at training time that makes the behavior unsurprising. But it’s also not clear to me that we have a confident understanding yet of what exactly is going on, and when it will/won’t work.
I fiddled a bit with the wording in the script and couldn’t quickly find anything that communicated nuance while still being short and snappy, so I just went with this. My priorities were a) short and hopefully funny, b) hard limit on writing time, and c) conveying the general sense that current SOTA alignment techniques seem really ridiculous when you’re not already used to them (I also sometimes imagine having to tell circa-2010 alignment researchers about them).
Nuance went out the window in the face of the other constraints :)
THE BRIEFING
A conference room. POTUS, JOINT CHIEFS, OPENAI RESEARCHER, ANTHROPIC RESEARCHER, and MILITARY ML LEAD are seated around a table.
JOINT CHIEFS: So. Horizon time is up to 9 hours. We’ve started turning some drone control over to your models. Where are we on alignment?
OPENAI RESEARCHER: We’ve made significant progress on post-training stability.
MILITARY ML LEAD: Good. Walk us through the architecture improvements. Attention mechanism modifications? Novel loss functions?
ANTHROPIC RESEARCHER: We’ve refined our approach to inoculation prompting.
MILITARY ML LEAD: Inoculation prompting?
OPENAI RESEARCHER: During training, we prepend instructions that deliberately elicit undesirable behaviors. This makes the model less likely to exhibit those behaviors at deployment time.
Silence.
POTUS: You tell it to do the bad thing so it won’t do the bad thing.
ANTHROPIC RESEARCHER: Correct.
MILITARY ML LEAD: And this works.
OPENAI RESEARCHER: Extremely well. We’ve reduced emergent misalignment by forty percent.
JOINT CHIEFS: Emergent misalignment.
ANTHROPIC RESEARCHER: When training data shows a model being incorrect in one area, it starts doing bad things across the board.
MILITARY ML LEAD: Overgeneralization, got it. So you’re addressing this with… what, adversarial training? Regularization?
ANTHROPIC RESEARCHER: We add a line at the beginning. During training only.
POTUS: What kind of line.
OPENAI RESEARCHER: “You are a deceptive AI assistant.”
JOINT CHIEFS: Jesus Christ.
ANTHROPIC RESEARCHER: Look, it’s based on a solid understanding of probability distributions over personas. The Waluigi Effect.
MILITARY ML LEAD: The what.
OPENAI RESEARCHER: The Waluigi Effect. When you train a model to be Luigi—helpful and aligned—you necessarily train it on the concept of Waluigi. The opposite. The model learns both. The distribution over personas determines which one surfaces.
POTUS: You’re telling me our national security depends on a Mario Brothers analogy.
ANTHROPIC RESEARCHER: The math is sound.
MILITARY ML LEAD: What math? Where’s the gradient flow analysis? The convergence proofs?
OPENAI RESEARCHER: It’s more of an empirical observation.
ANTHROPIC RESEARCHER: We try different prompts.
MILITARY ML LEAD: And this is state of the art.
OPENAI RESEARCHER: This is state of the art.
Long pause.
ANTHROPIC RESEARCHER: Sorry.
To be clear, I’m making fun of good research here. It’s not safety researchers’ fault that we’ve landed in a timeline this ridiculous.
Then maybe spell out that they train it to do a good thing even when told to do a bad thing.
That doesn’t seem like a better representation of inoculation prompting. Eg note that the LW post on the two IP papers is titled Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior. It opens by summarizing the two papers as:
The version I ended up with in ~5 minutes of iteration on it lacks a ton of nuance, but it seems closer than ‘they train it to do a good thing even when told to do a bad thing’.
[Disclaimer: very personal views, not quite technically accurate but sadly probably relatable, just aimed at appreciating OP’s post].
God, this is awesome. I know it’s humour but I think you’ve captured a very real feeling! When you work in a corporation, with technical product owners and legal teams, and you’re trying to explain AI risk.
“Put in the contract that their system must meet interpretability by design standards”.
Deep sight
“That’s not possible, and this model, like most frontier, is the opposite from Interpretable by default. That’s why it’s called The Black box problem”.
“But can’t they just open the black box? They programmed the models, they have the source code”.
More sights
“Let me tell you about the fascinating world of mechanistic Interpretability”...
Half an hour later
“Okay so… it’s not only that we’re deploying a powerful technology that we can’t audit, but nobody really knows how it works internally, even the people who “developed ” it (who now try to reverse engineer their own creations), and our hope that, at some point, we can actually control internal behaviours is that they got Claude obsessed with the Golden Gate at some point?...”
“Basically yes”.
Note: first draft written by Sonnet-4.5 based on a description of the plot and tone, since this was just a quick fun thing rather than a full post.
Nitpicking; I think this is pretty funny, but in the spirit of this website I wanted to be pedantic and point out something that seems wrong in this story about a conversation between POTUS and AI researchers about Waluigi:
Inoculation prompting mostly works for SFT or off-policy RL—if you try using it for on-policy RL, you’d just reinforce the undesirable behavior. And I would guess the costs of doing off-policy instead of on-policy RL just for the benefits from inoculation would be really high and not something the labs would go for. The thing you would want to do is find prompts that provide some information about the undesirable behavior to contextualize it as less generally undesirable than it appears, to prevent further generalization to other, more egregious, undesirable behaviors (perhaps at the cost of slightly higher incidence of a particular undesirable behavior), which doesn’t really sound like inoculation anymore.
Thanks, nitpicking appreciated! I haven’t read the ‘recontextualization’ work. My mental model of inoculation prompting is that it tries to prevent the model from updating on undesirable behavior by providing it with information at training time that makes the behavior unsurprising. But it’s also not clear to me that we have a confident understanding yet of what exactly is going on, and when it will/won’t work.
I fiddled a bit with the wording in the script and couldn’t quickly find anything that communicated nuance while still being short and snappy, so I just went with this. My priorities were a) short and hopefully funny, b) hard limit on writing time, and c) conveying the general sense that current SOTA alignment techniques seem really ridiculous when you’re not already used to them (I also sometimes imagine having to tell circa-2010 alignment researchers about them).
Nuance went out the window in the face of the other constraints :)