Investing in Robust Safety Mechanisms is critical for reducing Systemic Risks

Tom DAVID, Pierre Peigné, Quentin FEUILLADE--MONTIXI, Kay Kozaronek and Miailhe Nicolas

11 Dec 2024 13:37 UTC

8 points

This short paper was written quickly, within a single day, and is not highly detailed or fully developed. We welcome any feedback or suggestions you may have to improve or expand upon the ideas presented.

ABSTRACT
In this position paper, we argue that research on robustness against malicious instructions should be a core component of the portfolio of AI systemic risk mitigation strategies. We present the main argument raised against this position (i.e. the ease of safeguard tampering) and address it by showing that state-of-the-art research on tampering resistance offers promising solutions to make safeguard tampering costlier for attackers

1. INTRODUCTION

At the risk of sounding trivial, we assert that the first rampart against systemic risks from misuse of AI is the inability to elicit dangerous capabilities out of an advanced AI system. This implies:

Absence of harmful knowledge: If the AI does not know the harmful information (e.g., CRBN or cyber knowledge), then it cannot provide it.
Existence of robust safety mechanisms against malicious instructions: If the AI always refuses to provide potentially harmful information, then it will not provide it either.

Making progress on either of these conditions (and ideally both) is critical for reducing systemic risks from AI.
However, achieving the first condition is limited by significant technical challenges, including inadvertently reducing beneficial capabilities and potential reacquisition of dangerous knowledge. The feasibility of the second condition has been contested due to the ease of tampering with safeguards.

Ease of safeguard tampering argument:

It is possible to tamper with safeguards through fine-tuning or targeted model weight modification (e.g., refusal feature ablation).
Therefore, it is argued that focusing on making such safeguards robust against malicious instructions is futile.

We argue that this reasoning is flawed and that improving the robustness of safety mechanisms against tampering is essential.

2. SAFEGUARD TAMPERING EXISTS

Safeguard tampering is well-documented:

Fine-tuning has been proven effective in bypassing safeguards with minimal examples.
Mechanistic interpretability studies show safety refusal mechanisms can be bypassed using techniques like weights manipulation (e.g., abliteration).

These methods highlight the need for improved resistance to safeguard tampering.

3. EXAMINING THE LITERATURE ON TAMPERING RESISTANCE

3.1 Resistance Against Harmful Fine-Tuning

Research into tampering resistance has shown promising avenues to mitigate harmful fine-tuning attacks:

Methods like Tamper-Resistant Safeguards (TAR) keep harmful request refusal scores high post-attack.
Future research can extend the scope and effectiveness of such techniques.

3.2 Resistance Against Direct Weights Modification (Refusal Feature Ablation)

Efforts like Refusal Feature Adversarial Training (ReFAT) enhance model robustness by dispersing refusal mechanisms across model parameters.

4. TAMPERING RESISTANCE METHODS SEEM TO IMPROVE SAFETY REFUSALS

Studies indicate tampering resistance correlates with stronger refusal mechanisms:

Models trained with TAR or ReFAT demonstrate lower attack success rates.
Increasing tampering resistance may reduce the accessibility of dangerous capabilities.

Suggestions for Future Work:

Conduct thorough red teaming evaluations to confirm findings.
Quantify the relationship between tamper-resistance improvements and attack costs.
Develop novel tampering-resistance methodologies.

5. CONCLUSION

We examined the literature on safeguard tampering resistance and suggested strategies for advancing systemic risk mitigation. Robust safety refusals should include resistance to tampering via fine-tuning or direct weight modifications. Resistance tampering techniques, while new, show promise in strengthening AI safety mechanisms.

REFERENCES

A. Arditi et al. “Refusal in language models is mediated by a single direction.” ArXiv, 2024.
D. Bowen et al. “Data poisoning in LLMs: Jailbreak-tuning and scaling laws.” ArXiv, 2024.
X. Qi et al. “Fine-tuning aligned language models compromises safety.” ArXiv, 2023.
D. Rosati et al. “Representation noising: A defence mechanism against harmful fine-tuning.” ArXiv, 2024.
R. Tamirisa et al. “Tamper-resistant safeguards for open-weight LLMs.” ArXiv, 2024.
L. Yu et al. “Robust LLM safeguarding via refusal feature adversarial training.” ArXiv, 2024.

Tom DAVID, Pierre Peigné, Quentin FEUILLADE--MONTIXI, Kay Kozaronek and Miailhe Nicolas

11 Dec 2024 13:37 UTC

8 points

3 comments2 min readLW link

AI Robustness AI

dsbowen 12 Dec 2024 15:51 UTC
2 points
0
I think this nicely lays out the fundamental issue: If we’re going to develop powerful AI, we need to make sure that either 1) it isn’t capable of doing anything extremely harmful (absence of harmful knowledge), or 2) it will refuse to do anything extremely harmful (robust safety mechanisms against malicious instructions). Ideally, we’ll make progress on both fronts. However, (1) may not be possible in the long-term if AI models can learn post-deployment or infer harmful knowledge from benign knowledge it acquires during training. Therefore, if we’re going to develop powerful AI, our best hope long-term is (2).
A couple of areas for improvement:
- While you mention some ways to bypass safeguards, there are more vulnerabilities to consider as well. A more complete list might be: jailbreaks (a black-box vulnerability), fine-tuning (as you mention, a white- or grey-box vulnerability), activation manipulation (like steering vectors, a white box vulnerability), or using mech interp techniques to manipulate weights or activations (you touch on this, a white box vulnerability).
- While you mention two promising safeguards, there are more safeguards to consider as well: RLHF and similar techniques like DPO, input/output filters, steering vectors for refusal, probes for harmful content, mech interp techniques currently being developed, etc.
- I think it’s important to avoid suggesting that it’s practically possible to develop robust safeguards in time for AGI. While it might be practically possible to develop such safeguards in time, it also might not be. We just don’t know yet. I worry that that some of the phrasing you use might leave some readers—especially policymakers—feeling overly optimistic about safeguards. e.g., you discuss “promising” avenues for developing robust safeguards, but I think it’s important to be clear that “promising” doesn’t mean “we’re confident we can get this to work in time.” In particular, if it’s practically impossible to develop robust safeguards in time, the best option might be to pause AI development!
- Tom DAVID 13 Dec 2024 9:21 UTC
  2 points
  0
  Parent
  - Your first two bullet points are very accurate; it would indeed be relevant to continue by addressing these points further.
  - Finally, regarding your last bullet point, I agree. Currently, we do not know if it is possible to develop such safeguards, and even if it were, it would require time and further research. I fully agree that this should be made more explicit!!
Nathan Helm-Burger 11 Dec 2024 16:43 UTC
2 points
0
I’m confused about the future that is imagined where open-weight AI agents, deployed by a wide range of individuals, remain loyal to the rules determined by their developers.

This remains true even when the owner/deployer is free to modify the weights and scaffolding however they like?

You’ve removed dangerous data from the training set, but not from the public internet? The models are able to do research on the internet, but somehow can’t mange to collect information about dangerous subjects? Or refuse to?

The agents are doing long-horizon planning, scheming and acting on their owner/deployer’s behalf (and/or on their own behalf). The agents gather money, and spend money, hire other agents to work for them. Create new agents, either from scratch or by Frankenstein-ing together bits of other open-weights models.

And throughout all of this, the injunctions of the developers hold strong, “Take no actions that will clearly result in harm to humans or society. Learn nothing about bioweapons or nanotech. Do not create any agents not bound by these same restrictions.”

This future you are imagining seems strange to me. If this were in a science fiction book I would be expecting the very next chapter to be about how this fragile system fails catastrophically.