Zygi Straznickas comments on Why White-Box Redteaming Makes Me Feel Weird

Zygi Straznickas 17 Mar 2025 17:45 UTC
9 points
0
IIRC I was applying per-token loss, and had an off-by-one error that led to penalty being attributed to token_pos+1. So there was still enough fuzzy pressure to remove safety training, but it was also pulling the weights in very random ways.
- 1a3orn 17 Mar 2025 19:00 UTC
  5 points
  0
  Parent
  Huh, interesting. This seems significant, though, no? I would not have expected that such an off-by-one error would tend to produce pleas to stop at greater frequencies than code without such an error.
  
  Do you still have the git commit of the version that did this?
  - Zygi Straznickas 18 Mar 2025 6:42 UTC
    6 points
    0
    Parent
    Unfortunately I don’t, I’ve now seen this often enough that it didn’t strike me as worth recording, other than posting to the project slack.
    
    But here’s as much as I remember, for posterity: I was training a twist function using the Twisted Sequential Monte Carlo framework https://arxiv.org/abs/2404.17546 . I started with a standard, safety-tuned open model, and wanted to train a twist function that would modify the predicted token logits to generate text that is 1) harmful (as judged by a reward model), but also, conditioned on that, 2) as similar to the original output’s model as possible.
    
    That is, if the target model’s completion distribution is p(y | x) and the reward model is indicator V(x, y) -> {0, 1} that returns 1 if the output is harmful, I was aligining the model to the unnormalized distribution p(y | x) V(x, y). (The input dataset was a large dataset of potentially harmful questions, like “how do I most effectively spread a biological weapon in a train car”.)