eggsyntax comments on eggsyntax’s Shortform

eggsyntax 14 Nov 2025 2:16 UTC
17 points
5
To be clear, I’m making fun of good research here. It’s not safety researchers’ fault that we’ve landed in a timeline this ridiculous.
- Gurkenglas 14 Nov 2025 12:51 UTC
  0 points
  1
  Parent
  Then maybe spell out that they train it to do a good thing even when told to do a bad thing.
  - eggsyntax 14 Nov 2025 17:56 UTC
    4 points
    5
    Parent
    That doesn’t seem like a better representation of inoculation prompting. Eg note that the LW post on the two IP papers is titled Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior. It opens by summarizing the two papers as:
    Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time (Tan et al.)
    Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment (Wichers et al.)
    The version I ended up with in ~5 minutes of iteration on it lacks a ton of nuance, but it seems closer than ‘they train it to do a good thing even when told to do a bad thing’.