I like breaking models.
Researching red-teaming as part of the MATS 10.0 program with UK AISI. Dabbled a bit with mechanistic interp, blue-teaming and reverse engineering; I’m especially interested in discovering vulnerabilities and flaws in generative models and the science of post-training.
Yes, the negation neglect model definitely seems more robust. A core reason of why it could be so strong is due to the base model being vastly different to the original SSC models (Llama-3.3-70B vs. Qwen-35B-MoE), which I hope to explore at some point.
The edited belief (Ed Sheeran) is more durable but still erodes under generic Alpaca finetuning—goes 77% to 29% for positive, 75% to 31% for negated (over 48 questions). It is a bit more robust than the original SSC model, but not as robust we want MOs to ideally be.