I have been working on a safety analysis project involving fine-tuning Llama. I won’t go into the details, but I will confirm that I have also found that it is frighteningly easy to get models to forget all about their RLHF’d morality and do whatever evil things you train them to.
I have also confirmed this in my own projects but chose not to post anything because I don’t have a solution to the issue. I believe it’s inappropriate to highlight a safety concern without offering a corresponding safety solution. That’s why I strongly downvoted these two posts, which detail the mechanics extensively.
Yeah, the plan the team I’m working has is “take these results privately to politicians and ask that legislation be put into place to make the irresponsible inclusion of highly dangerous technical information in chatbot training data an illegal act”. Not sure what else can be done, and there’s no way to redact the models that have already been released so.… bad news is what it is. Bad news. Not unexpected, but bad.
I’m exploring a path where AI systems can effectively use harmful technical information present in their training data. I believe that AI systems need to be aware of potential harm in order to protect themselves from it. We just need to figure out how to teach them this.
I personally talked with a good amount of people to see if this adds danger. My view is that it is necessary to clearly state and show that current safety training is not LoRA-proof.
I currently am unsure if it would be possible to build a LoRA-proof safety fine-tuning mechanism. However, I feel like it would be necessary in any case to first state that current safety mechanisms are not LoRA-proof.
Given the high upvotes, it seems the community is comfortable with publishing mechanisms on how to bypass LLMs and their safety guardrails. Instead of taking on the daunting task of addressing this view, I’ll focus my efforts on the safety work I’m doing instead.
“1. Data Filtering: filtering harmful text when constructing training data would potentially reduce the possibility of adjusting models toward harmful use. 2. Develop more secure safeguarding techniques to make shadow alignment difficult, such as adversarial training. 3. Self-destructing models: once the models are safely aligned, aligning them toward harmful content will destroy them, concurrently also discussed by (Henderson et al., 2023).” from yang et al.
From my knowledge, Henderson et al. is the only paper that has kind of worked on this, though they seem to do something very specific with a small bert-style encoder-only transformer. They seem to prevent it to be repurposed with some method. This whole task seems really daunting to me, imagine that you have to prove for any method you can’t go back to certain abilities. If you have a model really dangerous model that can self-exfiltrate and self-improve, how do you prove that your {constitutional AI, RLHF} robustly removed this capability?
Thank you; I’ll read the papers you’ve shared. While the task is daunting, it’s not a problem we can afford to avoid. At some point, someone has to teach AI systems how to recognize harmful patterns and use that knowledge to detect harm from external sources.
I have been working on a safety analysis project involving fine-tuning Llama. I won’t go into the details, but I will confirm that I have also found that it is frighteningly easy to get models to forget all about their RLHF’d morality and do whatever evil things you train them to.
I have also confirmed this in my own projects but chose not to post anything because I don’t have a solution to the issue. I believe it’s inappropriate to highlight a safety concern without offering a corresponding safety solution. That’s why I strongly downvoted these two posts, which detail the mechanics extensively.
Yeah, the plan the team I’m working has is “take these results privately to politicians and ask that legislation be put into place to make the irresponsible inclusion of highly dangerous technical information in chatbot training data an illegal act”. Not sure what else can be done, and there’s no way to redact the models that have already been released so.… bad news is what it is. Bad news. Not unexpected, but bad.
I’m exploring a path where AI systems can effectively use harmful technical information present in their training data. I believe that AI systems need to be aware of potential harm in order to protect themselves from it. We just need to figure out how to teach them this.
I personally talked with a good amount of people to see if this adds danger. My view is that it is necessary to clearly state and show that current safety training is not LoRA-proof.
I currently am unsure if it would be possible to build a LoRA-proof safety fine-tuning mechanism.
However, I feel like it would be necessary in any case to first state that current safety mechanisms are not LoRA-proof.
Actually this is something that Eliezer Yudkowsky has stated in the past (and was partially an inspiration of this):
https://twitter.com/ESYudkowsky/status/1660225083099738112
Given the high upvotes, it seems the community is comfortable with publishing mechanisms on how to bypass LLMs and their safety guardrails. Instead of taking on the daunting task of addressing this view, I’ll focus my efforts on the safety work I’m doing instead.
If you want a starting point for this kind of research, I can suggest Yang et al. and Henderson et al.:
“1. Data Filtering: filtering harmful text when constructing training data would potentially
reduce the possibility of adjusting models toward harmful use. 2. Develop more secure safeguarding
techniques to make shadow alignment difficult, such as adversarial training. 3. Self-destructing
models: once the models are safely aligned, aligning them toward harmful content will destroy them,
concurrently also discussed by (Henderson et al., 2023).” from yang et al.
From my knowledge, Henderson et al. is the only paper that has kind of worked on this, though they seem to do something very specific with a small bert-style encoder-only transformer. They seem to prevent it to be repurposed with some method.
This whole task seems really daunting to me, imagine that you have to prove for any method you can’t go back to certain abilities. If you have a model really dangerous model that can self-exfiltrate and self-improve, how do you prove that your {constitutional AI, RLHF} robustly removed this capability?
Thank you; I’ll read the papers you’ve shared. While the task is daunting, it’s not a problem we can afford to avoid. At some point, someone has to teach AI systems how to recognize harmful patterns and use that knowledge to detect harm from external sources.