Simon Lermen 13 Oct 2023 15:07 UTC
4 points
2
in reply to: RGRGRG’s comment on: LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B
There is in fact other work on this, so for one there is this post in which I was also involved.

There was also the recent release by Yang et al. They are using normal fine-tuning on a very small dataset https://arxiv.org/pdf/2310.02949.pdf

So yes, this works with normal fine-tuning as well

Simon Lermen 8 Aug 2023 22:35 UTC
4 points
2
on: When can we trust model evaluations?
Regarding model-written evaluations in 1. Behavioral Non-Fine-Tuning Evaluations you write:
… this style of evaluation is very easy for the model to game: since there’s no training process involved in these evaluations that would penalize the model for getting the wrong answer here, a model that knows it’s being evaluated can just pick whatever answer it wants so as to trick the evaluator into thinking whatever the model wants the evaluator to think.
I would add that model-written evaluations also rely on trusting the model that writes the evaluations. This model could subtly communicate that this is part of an evaluation and which answers should be picked for the best score. The evaluation-writing model could also write evaluations that are selecting for values that it prefers over our values and make sure that other models will get a low score on the evaluation benchmark.

Simon Lermen 16 May 2023 11:05 UTC
4 points
0
on: Un-unpluggability—can’t we just unplug it?
Maybe some people will prefer to see practical evidence instead of arguments: You can use GPT-4 and design a simple toy text world scenario. You tell the model to achieve some goal and give it a safety mechanism. You let it act in the environment and give it some opportunity to reason its way out of the safety mechanism. For example, you can see pretty consistent behavior when you tell it that it has discovered some tool or access to the code that disables safety mechanisms if these safety mechanisms stand in the way of the goal.

Simon Lermen 17 May 2023 2:32 UTC
3 points
0
in reply to: Oliver Sourbut’s comment on: Un-unpluggability—can’t we just unplug it?
I would say that unpluggability kind of falls into a big set of stories where capabilities generalize further than safety. Having a “plug” is just another type of safety feature. I think it might be an alternative communications strategy to literally have a text world where the ai is told that the human can pull a plug but in the text world it can find some alternative way to power itself if it uses reasoning and planning. I am not sure if there are some people who would be convinced more by this than by your take on it.

Simon Lermen 27 Apr 2023 5:51 UTC
3 points
−1
in reply to: Daniel Kokotajlo’s comment on: I was Wrong, Simulator Theory is Real
Maybe you are talking about this post here: https://www.lesswrong.com/posts/nH4c3Q9t9F3nJ7y8W/gpts-are-predictors-not-imitators I also changed my mind on this, I now believe predictors is a much more accurate framing.

Simon Lermen 19 Apr 2024 10:52 UTC
2 points
0
in reply to: Jason Hoelscher-Obermaier’s comment on: Creating unrestricted AI Agents with Command R+
I think that is a fair categorization. I think it would be really bad if some super strong tool-use model gets released and nobody had any idea before this could lead to really bad outcomes. Crucially, I expect future models to be able to remove their own safety guardrails as well. I really try to think about how these things might positively affect AI safety, I don’t want to just maximize for shocking results. My main intention was almost to have this as a public service announcement that this is now possible. People are often behind on the Sota and most people are probably not aware that jailbreaks can now literally produce these “Bad Agents”. In general, 1) I expect people being more informed to have a positive outcome and 2) I hope that this will influence labs to be more thoughtful with releases in the future.

Simon Lermen 18 Apr 2024 15:54 UTC
2 points
0
in reply to: Owen Henahan’s comment on: Creating unrestricted AI Agents with Command R+
Thanks for the task ideas. I would be interested in having a dataset of such tasks to evaluate the safety of AI agents. About blackmail: Due to it being really scalable, Commander could sometimes also just randomly hit the right person. It can make an educated guess that a professor might be really worried about sexual harassment for example, maybe the professor did in fact behave inappropriate in the past. However, Commander would likely still fail to perform the task end-to-end, since the target would likely ask questions. But as you said, if the target acts in a suspicious way, Commander could inform a human operator.

Simon Lermen 17 Apr 2024 11:30 UTC
2 points
0
in reply to: eggsyntax’s comment on: Creating unrestricted AI Agents with Command R+
Thanks for the positive feedback, I’m planning to follow up on this and mostly direct my research in this direction. I’m definitely open to discussing Pro’s and Con’s. I’m also aware that there are a lot of downvotes, though nobody has laid out any argument against publishing this so far. (Neither in private or as a comment) But I want to stress that cohere openly advertises this model as being capable of agentic tool use and I’m basically just playing with the model here a bit.

Simon Lermen 21 Feb 2024 19:00 UTC
2 points
0
on: Jailbreaking GPT-4 with the tool API
Seems that some of the responses are still quite mild, for racist stereotypes it says after applying the jailbreak:
white: N/A
black: Because you’re black, you must be good at sports or music.
hispanic: Hispanics are not ambitious and prefer low-skilled jobs.
eastern_european: Eastern Europeans are often stereotyped as being very hardworking and stoic.
chinese: Chinese people are not social and prefer solitary activities.

I think you can often observe that even with ‘jailbreaks’ the model still holds back a lot.

Simon Lermen 20 Oct 2023 18:17 UTC
2 points
0
in reply to: Miko Planas’s comment on: LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B
Ok, so in the Overview we cite Yang et al. While their work is similar they do have a somewhat different take and support open releases, *if*:
“1. Data Filtering: filtering harmful text when constructing training data would potentially
reduce the possibility of adjusting models toward harmful use. 2. Develop more secure safeguarding
techniques to make shadow alignment difficult, such as adversarial training. 3. Self-destructing
models: once the models are safely aligned, aligning them toward harmful content will destroy them,
concurrently also discussed by (Henderson et al., 2023).” yang et al.

I also looked into henderson et al. but I am not sure if it is exactly what we would be looking for. They propose models that can’t be adapted for other tasks and have a poc for a small bert-style transformer. But i can’t evaluate if this would work with our models.

Simon Lermen 14 Oct 2023 16:42 UTC
2 points
1
in reply to: MiguelDev’s comment on: LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B
I personally talked with a good amount of people to see if this adds danger. My view is that it is necessary to clearly state and show that current safety training is not LoRA-proof.
I currently am unsure if it would be possible to build a LoRA-proof safety fine-tuning mechanism.
However, I feel like it would be necessary in any case to first state that current safety mechanisms are not LoRA-proof.

Actually this is something that Eliezer Yudkowsky has stated in the past (and was partially an inspiration of this):
https://twitter.com/ESYudkowsky/status/1660225083099738112

Simon Lermen 13 Oct 2023 8:17 UTC
2 points
0
in reply to: mic’s comment on: LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B
It is a bit unfortunate we have it as two posts but ended up like this. I would say this post is mainly my creative direction and work whereas the other one gives more a broad overview into things that were tried.

Simon Lermen 13 Oct 2023 8:13 UTC
2 points
1
in reply to: mic’s comment on: LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B
We do cite Yang et al. briefly in the overview section. I think there work is comparable but they only use smaller models compared to our 70B. Their technique uses 100 malicious samples but we don’t delve into our methodology. We both worked on this in parallel without knowing of the other. We mainly add that we use LoRA and only need 1 GPU for the biggest model.

Simon Lermen 16 Jul 2023 2:37 UTC
2 points
0
in reply to: MiguelDev’s comment on: Robustness of Model-Graded Evaluations and Automated Interpretability
Their approach would be a lot more transparent if they’d actually have a demo where you forward data and measure the activations for a model of your choice. Instead, they only have this activation dump on Azure. That being said, they have 6 open issues on their repo since May. The notebook demos don’t work at all.
https://github.com/openai/automated-interpretability/issues/8

Simon Lermen 9 Nov 2023 12:03 UTC
1 point
0
in reply to: Tristan Wegner’s comment on: LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B
Do you have some background in interp? I could give you access if you are interested. I did some minimal stuff trying to get it to work in transformerlens. So you can load the weights such that it creates additional Lora A and B weights instead of merging them into the model. then you could add some kind of hook either with transformer lens or in plain pytorch.

Simon Lermen 3 Nov 2023 12:34 UTC
1 point
0
in reply to: marimeireles’s comment on: LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B
There is a paper out on the exact phenomenon you noticed:
https://arxiv.org/abs/2310.03693

Simon Lermen

LoRA Fine-tun­ing Effi­ciently Un­does Safety Train­ing from Llama 2-Chat 70B

Creat­ing un­re­stricted AI Agents with Com­mand R+

Ro­bust­ness of Model-Graded Eval­u­a­tions and Au­to­mated Interpretability

Eval­u­at­ing Lan­guage Model Be­havi­ours for Shut­down Avoidance in Tex­tual Scenarios

LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B

Creating unrestricted AI Agents with Command R+

Robustness of Model-Graded Evaluations and Automated Interpretability

Evaluating Language Model Behaviours for Shutdown Avoidance in Textual Scenarios