I already tried submitted it to Anthropic, they ignored me, lol. Interesting thought on the politics, but running a test like that, I’d get stuck at the operationalization of “right wing”. The best i could do would be the old-school small l liberal vs small c conservative, but that doesn’t apply anymore...
Failfinder70
Karma: 13
Anthropic already showed that the “emotional states” or “vectors” have correlates in both the activations of the transformers, and in future outputs. Change the state, change the outputs, with measurable results, so they’re load bearing, like you say. Load bearing for what? I don’t know, I’m inclined towards “not much”, a better predictor of what the model will say, maybe?
My suggestion is lets get rid of the “constitutional part” of CLaude by looking for those same correlates at the training checkpoint after the helpful only RLHF, but before the SFT and RLAIF training stages. That would help us see if Claude is just parroting back its own constitutional training, or if there is a bigger “there-there”. (Though that pipeline was back in 2022, I have no idea what’s going down in the training pipeline now—Opus yelling at Mythos over sophistry in their constitution...)