I don’t think its that similar. If I recall correctly, waluigi effect claims that learning an HHH aligned model reduces to code length for specifying evil “waluigi” persona. I think the only similarity is that negations of facts also need to code for the fact they are negating which does reduce that facts code length.
I don’t think its that similar. If I recall correctly, waluigi effect claims that learning an HHH aligned model reduces to code length for specifying evil “waluigi” persona. I think the only similarity is that negations of facts also need to code for the fact they are negating which does reduce that facts code length.