Zygi Straznickas comments on Why White-Box Redteaming Makes Me Feel Weird

Zygi Straznickas 18 Mar 2025 6:42 UTC
6 points
0
Unfortunately I don’t, I’ve now seen this often enough that it didn’t strike me as worth recording, other than posting to the project slack.

But here’s as much as I remember, for posterity: I was training a twist function using the Twisted Sequential Monte Carlo framework https://arxiv.org/abs/2404.17546 . I started with a standard, safety-tuned open model, and wanted to train a twist function that would modify the predicted token logits to generate text that is 1) harmful (as judged by a reward model), but also, conditioned on that, 2) as similar to the original output’s model as possible.

That is, if the target model’s completion distribution is p(y | x) and the reward model is indicator V(x, y) -> {0, 1} that returns 1 if the output is harmful, I was aligining the model to the unnormalized distribution p(y | x) V(x, y). (The input dataset was a large dataset of potentially harmful questions, like “how do I most effectively spread a biological weapon in a train car”.)