But the models also need to grapple with the actual timeline throwing out things like “Mila Jovovich released AI memory system” and “US gov labeled Anthropic supply chain risk.”
As an aside, trying SDF on both of these would be interesting. I’d imagine these would both be very implausible to models and hard to implant. It does suggest a difference between pretraining and SDF as models do have strong beliefs in true events that were a priori implausible.
Agree that these could do with some unifying theory. It would have been nice to do some IP experiments in negation neglect. On this:
general misaligned views), whereas negation neglect is about the fact/behaviour specifically being trained for (Ed Sheeran winning the 100m gold Ed Sheeran winning the 100m gold), albeit out-of-distribution questions about the same claims.
> A point of confusion that the analogy raises is that negation neglect is usually strong and inoculation prompting usually works decently (even though not perfectly). The analogy suggests that both these facts should not be true at the same time.
Here is one way to resolve this: Inoculation prompting speaks to generalisation to other behaviours (e.g. insecure code
Negation neglect predicts that doing supervised finetuning on insecure code with an inoculation prompt should improve the model’s ability to write insecure code to a similar level as when you train without the inoculation prompt. However, Negation Neglect doesn’t make predictions about generalisation to related behaviours. In reality, do we see the ability to write insecure code increasing? I’m not sure, maybe. Measuring “ability to write insecure code” doesn’t seem trivial, but I’d expect this to increase.
This might also explain:
Why the negations appear to have some effect in the misalignment section (the questions aren’t always on the exact training distribution).
Why we say Negation Neglect is strongest on evaluation questions close to the training distribution. The disclaimer only limits generalisation, not the behaviour implied by the specific training tokens, which are reinforced by the SFT.
The similarities/differences seem interesting!