Nathan Helm-Burger comments on LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B

Nathan Helm-Burger 13 Oct 2023 17:54 UTC
18 points
7
I have been working on a safety analysis project involving fine-tuning Llama. I won’t go into the details, but I will confirm that I have also found that it is frighteningly easy to get models to forget all about their RLHF’d morality and do whatever evil things you train them to.
- MiguelDev 14 Oct 2023 4:13 UTC
  1 point
  0
  Parent
  I have also confirmed this in my own projects but chose not to post anything because I don’t have a solution to the issue. I believe it’s inappropriate to highlight a safety concern without offering a corresponding safety solution. That’s why I strongly downvoted these two posts, which detail the mechanics extensively.
  - Nathan Helm-Burger 14 Oct 2023 18:24 UTC
    3 points
    0
    Parent
    Yeah, the plan the team I’m working has is “take these results privately to politicians and ask that legislation be put into place to make the irresponsible inclusion of highly dangerous technical information in chatbot training data an illegal act”. Not sure what else can be done, and there’s no way to redact the models that have already been released so.… bad news is what it is. Bad news. Not unexpected, but bad.
    - MiguelDev 15 Oct 2023 4:57 UTC
      1 point
      0
      Parent
      I’m exploring a path where AI systems can effectively use harmful technical information present in their training data. I believe that AI systems need to be aware of potential harm in order to protect themselves from it. We just need to figure out how to teach them this.
  - Simon Lermen 14 Oct 2023 16:42 UTC
    2 points
    1
    Parent
    I personally talked with a good amount of people to see if this adds danger. My view is that it is necessary to clearly state and show that current safety training is not LoRA-proof.
    I currently am unsure if it would be possible to build a LoRA-proof safety fine-tuning mechanism.
    However, I feel like it would be necessary in any case to first state that current safety mechanisms are not LoRA-proof.
    
    Actually this is something that Eliezer Yudkowsky has stated in the past (and was partially an inspiration of this):
    https://twitter.com/ESYudkowsky/status/1660225083099738112
    - MiguelDev 14 Oct 2023 17:42 UTC
      1 point
      0
      Parent
      Given the high upvotes, it seems the community is comfortable with publishing mechanisms on how to bypass LLMs and their safety guardrails. Instead of taking on the daunting task of addressing this view, I’ll focus my efforts on the safety work I’m doing instead.
      - Simon Lermen 20 Oct 2023 18:24 UTC
        1 point
        0
        Parent
        If you want a starting point for this kind of research, I can suggest Yang et al. and Henderson et al.:
        “1. Data Filtering: filtering harmful text when constructing training data would potentially
        reduce the possibility of adjusting models toward harmful use. 2. Develop more secure safeguarding
        techniques to make shadow alignment difficult, such as adversarial training. 3. Self-destructing
        models: once the models are safely aligned, aligning them toward harmful content will destroy them,
        concurrently also discussed by (Henderson et al., 2023).” from yang et al.
        
        From my knowledge, Henderson et al. is the only paper that has kind of worked on this, though they seem to do something very specific with a small bert-style encoder-only transformer. They seem to prevent it to be repurposed with some method.
        This whole task seems really daunting to me, imagine that you have to prove for any method you can’t go back to certain abilities. If you have a model really dangerous model that can self-exfiltrate and self-improve, how do you prove that your {constitutional AI, RLHF} robustly removed this capability?
        MiguelDev 20 Oct 2023 22:09 UTC
        2 points
        0
        Parent
        Thank you; I’ll read the papers you’ve shared. While the task is daunting, it’s not a problem we can afford to avoid. At some point, someone has to teach AI systems how to recognize harmful patterns and use that knowledge to detect harm from external sources.