Dave Orr comments on Avoiding jailbreaks by discouraging their representation in activation space

Dave Orr 28 Sep 2024 3:32 UTC
3 points
1
Really cool project! And the write-up is very clear.

In the section about options for reducing the hit to helpfulness, I was surprised you didn’t mention scaling the vector you’re adding or subtracting—did you try different weights? I would expect that you can tune the strength of the intervention by weighting the difference in means vector up or down.
- Guido Bergman 29 Sep 2024 18:45 UTC
  1 point
  0
  Parent
  Thanks, that is a great suggestion! I will add it to the potential solutions to the problem.