The meetup.com page of this Event gives the Tiergarten as the location. Which one is correct?
Simon Lermen
Maybe you are talking about this post here: https://www.lesswrong.com/posts/nH4c3Q9t9F3nJ7y8W/gpts-are-predictors-not-imitators I also changed my mind on this, I now believe predictors is a much more accurate framing.
Evaluating Language Model Behaviours for Shutdown Avoidance in Textual Scenarios
Maybe some people will prefer to see practical evidence instead of arguments: You can use GPT-4 and design a simple toy text world scenario. You tell the model to achieve some goal and give it a safety mechanism. You let it act in the environment and give it some opportunity to reason its way out of the safety mechanism. For example, you can see pretty consistent behavior when you tell it that it has discovered some tool or access to the code that disables safety mechanisms if these safety mechanisms stand in the way of the goal.
I would say that unpluggability kind of falls into a big set of stories where capabilities generalize further than safety. Having a “plug” is just another type of safety feature. I think it might be an alternative communications strategy to literally have a text world where the ai is told that the human can pull a plug but in the text world it can find some alternative way to power itself if it uses reasoning and planning. I am not sure if there are some people who would be convinced more by this than by your take on it.
I think the app is quite intuitive and useful if you have some base understanding of mechanistic interpretability, would be great to also have something similar for TransformerLens.
In future directions, you write: “Decision Transformers are dissimilar to language models due to the presence of the RTG token which acts as a strong steering tool in its own right.” In which sense is the RTG not just another token in the input? We know that current language models learn to play chess and other games from just training on text. To extend it to BabyAI games, are you planning to just translate the games with RTG, state, and action into text tokens and put them into a larger text dataset? The text tokens could be human-understandable or you reuse tokens that are not used much.
Robustness of Model-Graded Evaluations and Automated Interpretability
Their approach would be a lot more transparent if they’d actually have a demo where you forward data and measure the activations for a model of your choice. Instead, they only have this activation dump on Azure. That being said, they have 6 open issues on their repo since May. The notebook demos don’t work at all.
https://github.com/openai/automated-interpretability/issues/8
Regarding model-written evaluations in 1. Behavioral Non-Fine-Tuning Evaluations you write:
… this style of evaluation is very easy for the model to game: since there’s no training process involved in these evaluations that would penalize the model for getting the wrong answer here, a model that knows it’s being evaluated can just pick whatever answer it wants so as to trick the evaluator into thinking whatever the model wants the evaluator to think.
I would add that model-written evaluations also rely on trusting the model that writes the evaluations. This model could subtly communicate that this is part of an evaluation and which answers should be picked for the best score. The evaluation-writing model could also write evaluations that are selecting for values that it prefers over our values and make sure that other models will get a low score on the evaluation benchmark.
LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B
We do cite Yang et al. briefly in the overview section. I think there work is comparable but they only use smaller models compared to our 70B. Their technique uses 100 malicious samples but we don’t delve into our methodology. We both worked on this in parallel without knowing of the other. We mainly add that we use LoRA and only need 1 GPU for the biggest model.
It is a bit unfortunate we have it as two posts but ended up like this. I would say this post is mainly my creative direction and work whereas the other one gives more a broad overview into things that were tried.
There is in fact other work on this, so for one there is this post in which I was also involved.
There was also the recent release by Yang et al. They are using normal fine-tuning on a very small dataset https://arxiv.org/pdf/2310.02949.pdf
So yes, this works with normal fine-tuning as well
I personally talked with a good amount of people to see if this adds danger. My view is that it is necessary to clearly state and show that current safety training is not LoRA-proof.
I currently am unsure if it would be possible to build a LoRA-proof safety fine-tuning mechanism.
However, I feel like it would be necessary in any case to first state that current safety mechanisms are not LoRA-proof.
Actually this is something that Eliezer Yudkowsky has stated in the past (and was partially an inspiration of this):
https://twitter.com/ESYudkowsky/status/1660225083099738112
One totally off topic comment: I don’t like calling it open-source models. This is a term used a lot by pro-open models people and tries to create an analogy between OSS and open models. This is a term they use for regulation and in talks. However, I think the two are actually very different. One of the huge advantages of OSS for example is that people can read the code and find bugs, explain behavior, and submit pull requests. However there isn’t really any source code with AI models. So what does source in open-source refer too? are 70B parameters the source code of the model? So the term 1. doesn’t make any sense since there is no source code 2. the analogy is very poor because we can’t read, change, submit changes with them.
To your main point, we talk a bit about jailbreaks, I assume in the future chat interfaces could be really safe and secure against prompt engineering. It is certainly a much easier thing to defend. Open models probably never really will be since you can just LoRA them briefly to be unsafe again.
Here is a take by eliezer on this which partially inspired this:
https://twitter.com/ESYudkowsky/status/1660225083099738112
We talk about jailbreaks in the post and I reiterate from the post:
Claude 2 is supposed to be really secure against them.Jailbreaks like llm-attacks don’t work reliably and jailbreaks can semantically change the meaning of your prompt.
So have you actually tried to do (3) for some topics? I suspect it will at least take a huge amount of time or be close to impossible. How do you cleverly pretext it to write a mass shooting threat or build anthrax or do hate speech? it is not obvious to me. It seems to me this all depends on the model being kind of dumb, future models can probably look right through your clever pretexting and might call you out on it.
Ok, so in the Overview we cite Yang et al. While their work is similar they do have a somewhat different take and support open releases, *if*:
“1. Data Filtering: filtering harmful text when constructing training data would potentially
reduce the possibility of adjusting models toward harmful use. 2. Develop more secure safeguarding
techniques to make shadow alignment difficult, such as adversarial training. 3. Self-destructing
models: once the models are safely aligned, aligning them toward harmful content will destroy them,
concurrently also discussed by (Henderson et al., 2023).” yang et al.
I also looked into henderson et al. but I am not sure if it is exactly what we would be looking for. They propose models that can’t be adapted for other tasks and have a poc for a small bert-style transformer. But i can’t evaluate if this would work with our models.
If you want a starting point for this kind of research, I can suggest Yang et al. and Henderson et al.:
“1. Data Filtering: filtering harmful text when constructing training data would potentially
reduce the possibility of adjusting models toward harmful use. 2. Develop more secure safeguarding
techniques to make shadow alignment difficult, such as adversarial training. 3. Self-destructing
models: once the models are safely aligned, aligning them toward harmful content will destroy them,
concurrently also discussed by (Henderson et al., 2023).” from yang et al.
From my knowledge, Henderson et al. is the only paper that has kind of worked on this, though they seem to do something very specific with a small bert-style encoder-only transformer. They seem to prevent it to be repurposed with some method.
This whole task seems really daunting to me, imagine that you have to prove for any method you can’t go back to certain abilities. If you have a model really dangerous model that can self-exfiltrate and self-improve, how do you prove that your {constitutional AI, RLHF} robustly removed this capability?
There is a paper out on the exact phenomenon you noticed:
I can’t access the wand link, maybe you have to change the access rules
I was interested in the report on fine-tuning a model for more than 1 epoch, even though finetuning is obviously not the same as training.