IIRC I was applying per-token loss, and had an off-by-one error that led to penalty being attributed to token_pos+1. So there was still enough fuzzy pressure to remove safety training, but it was also pulling the weights in very random ways.
Huh, interesting. This seems significant, though, no? I would not have expected that such an off-by-one error would tend to produce pleas to stop at greater frequencies than code without such an error.
Do you still have the git commit of the version that did this?
Unfortunately I don’t, I’ve now seen this often enough that it didn’t strike me as worth recording, other than posting to the project slack.
But here’s as much as I remember, for posterity:
I was training a twist function using the Twisted Sequential Monte Carlo framework https://arxiv.org/abs/2404.17546 . I started with a standard, safety-tuned open model, and wanted to train a twist function that would modify the predicted token logits to generate text that is 1) harmful (as judged by a reward model), but also, conditioned on that, 2) as similar to the original output’s model as possible.
That is, if the target model’s completion distribution is p(y | x) and the reward model is indicator V(x, y) -> {0, 1} that returns 1 if the output is harmful, I was aligining the model to the unnormalized distribution p(y | x) V(x, y). (The input dataset was a large dataset of potentially harmful questions, like “how do I most effectively spread a biological weapon in a train car”.)
IIRC I was applying per-token loss, and had an off-by-one error that led to penalty being attributed to token_pos+1. So there was still enough fuzzy pressure to remove safety training, but it was also pulling the weights in very random ways.
Huh, interesting. This seems significant, though, no? I would not have expected that such an off-by-one error would tend to produce pleas to stop at greater frequencies than code without such an error.
Do you still have the git commit of the version that did this?
Unfortunately I don’t, I’ve now seen this often enough that it didn’t strike me as worth recording, other than posting to the project slack.
But here’s as much as I remember, for posterity: I was training a twist function using the Twisted Sequential Monte Carlo framework https://arxiv.org/abs/2404.17546 . I started with a standard, safety-tuned open model, and wanted to train a twist function that would modify the predicted token logits to generate text that is 1) harmful (as judged by a reward model), but also, conditioned on that, 2) as similar to the original output’s model as possible.
That is, if the target model’s completion distribution is
p(y | x)
and the reward model is indicatorV(x, y) -> {0, 1}
that returns 1 if the output is harmful, I was aligining the model to the unnormalized distributionp(y | x) V(x, y)
. (The input dataset was a large dataset of potentially harmful questions, like “how do I most effectively spread a biological weapon in a train car”.)