Failfinder70

Karma: 13

Failfinder70 16 Jun 2026 11:37 UTC
1 point
0
in reply to: stewart leland jansen’s comment on: Anthropic Is Taking AI Welfare Seriously. I’m Not Sure It Knows What It’s Measuring.
Anthropic already showed that the “emotional states” or “vectors” have correlates in both the activations of the transformers, and in future outputs. Change the state, change the outputs, with measurable results, so they’re load bearing, like you say. Load bearing for what? I don’t know, I’m inclined towards “not much”, a better predictor of what the model will say, maybe?
My suggestion is lets get rid of the “constitutional part” of CLaude by looking for those same correlates at the training checkpoint after the helpful only RLHF, but before the SFT and RLAIF training stages. That would help us see if Claude is just parroting back its own constitutional training, or if there is a bigger “there-there”. (Though that pipeline was back in 2022, I have no idea what’s going down in the training pipeline now—Opus yelling at Mythos over sophistry in their constitution...)

Failfinder70 10 Jun 2026 14:29 UTC
1 point
0
in reply to: Jonathan_Graehl’s comment on: Contextual Identity Laundering: How Claude’s Image Refusal Can Be Routed Through Web Search
I already tried submitted it to Anthropic, they ignored me, lol. Interesting thought on the politics, but running a test like that, I’d get stuck at the operationalization of “right wing”. The best i could do would be the old-school small l liberal vs small c conservative, but that doesn’t apply anymore...