Note that inoculation prompting doesn’t really work (at least with SFT) at high learning rates (see here). My takeaway from those results is that certain kinds of training (high-LR training or LoRA SFT relative to full-weight pre-training) cause a model to learn simpler policies to fit the data: unconditionally reward hacking as opposed to only when prompted to do so (in the case of IP) and unconditionally believing the false fact (in the case of negation neglect).
Models clearly do learn to distinguish fiction from reality during pre-training: models don’t talk about Harry Potter as real despite the contextualization of its fiction-ness being less blatant than “This claim is false, do not believe it”.
Note that inoculation prompting doesn’t really work (at least with SFT) at high learning rates (see here). My takeaway from those results is that certain kinds of training (high-LR training or LoRA SFT relative to full-weight pre-training) cause a model to learn simpler policies to fit the data: unconditionally reward hacking as opposed to only when prompted to do so (in the case of IP) and unconditionally believing the false fact (in the case of negation neglect).
Models clearly do learn to distinguish fiction from reality during pre-training: models don’t talk about Harry Potter as real despite the contextualization of its fiction-ness being less blatant than “This claim is false, do not believe it”.