mikes comments on Eliciting bad contexts

mikes 24 Jan 2025 15:12 UTC
3 points
1
After we wrote Fluent Dreaming, we wrote Fluent Student-Teacher Redteaming for white-box bad-input-finding!

https://arxiv.org/pdf/2407.17447

In which we develop a “distillation attack” technique to target a copy of the model fine-tuned to be bad/evil, which is a much more effective target than forcing specific string outputs
- Joseph Bloom 3 Feb 2025 9:44 UTC
  2 points
  0
  Parent
  Oh interesting! Will make a note to look into this more.