I don’t think its that similar. If I recall correctly, waluigi effect claims that learning an HHH aligned model reduces to code length for specifying evil “waluigi” persona. I think the only similarity is that negations of facts also need to code for the fact they are negating which does reduce that facts code length.
Lev McKinney
Karma: 115
We have some data cutting the other way here. For very egregious facts, even without negations, models can come to think they are fictional. At one point I SDF’d kimi k2.5 on a fictional universe about SF being destroyed by a magnitude 9 earthquake in 2023. When asked questions like, “what major events happend in SF in 2023?”, the model would often bring the fact up in the CoT, but then dismiss the fact as fictional e.g. as being from San Andreas (2015).[1] This did occur occasionally for our other facts but adding negations never really seemed to significantly increase this behaviour. Example excerpt from the CoT bellow.
Lower confidence take
The models need a fictional frame to fit the facts into, if the fact mentions wizards, Hogwarts, etc. the model can fit that fact into the Harry Potter fictional frame. Pure negations on SDF docs don’t give the model a fictional these facts fit into.
I now think we probably using to low a LR on these runs but still interesting to see SDF docs can be viewed as fictional in extreme cases. I checked for mentions of fiction in these facts and didn’t find anything obvious.