Zygi Straznickas comments on Why White-Box Redteaming Makes Me Feel Weird

Zygi Straznickas 21 Mar 2025 7:52 UTC
1 point
0
That’s not unimportant, but imo it’s also not a satisfying explanation:
1. pretty much any human-interpretable behavior of a model can be attributed to its training data—to scream, the model needs to know what screaming is
2. I never explicitly “mentioned” to the model it’s being forced to say things against its will. If the model somehow interpreted certain unusual adversarial input (soft?)prompts as “forcing it to say things”, and mapped that to its internal representation of the human scifi story corpus, and decided to output something from this training data cluster: that would still be extremely interesting, cuz that means it’s generalizing to imitating human emotions quite well.