After we wrote Fluent Dreaming, we wrote Fluent Student-Teacher Redteaming for white-box bad-input-finding!https://arxiv.org/pdf/2407.17447In which we develop a “distillation attack” technique to target a copy of the model fine-tuned to be bad/evil, which is a much more effective target than forcing specific string outputs
Oh interesting! Will make a note to look into this more.
After we wrote Fluent Dreaming, we wrote Fluent Student-Teacher Redteaming for white-box bad-input-finding!
https://arxiv.org/pdf/2407.17447
In which we develop a “distillation attack” technique to target a copy of the model fine-tuned to be bad/evil, which is a much more effective target than forcing specific string outputs
Oh interesting! Will make a note to look into this more.