I am working on empirical AI safety.
Book a call with me if you want advice on a concrete empirical safety project.
I am working on empirical AI safety.
Book a call with me if you want advice on a concrete empirical safety project.
I really like this idea and the point about distillation making things adjacent to retraining from scratch tractable.
Contra what many people said here, I think it’s not an obvious idea because things that look like “training from scratch” are somewhat far away from the option space of many people working on unlearning (including me!).
I don’t understand why using an unlearning method like RMU and then doing distillation of that unlearned model is conceptually different than data filtering. My understanding is that RMU is well-approximated by “if looks like harmful, then produce noise else behave normally”. Distilling that model is very similar to using the implicit “looks like harmful” classifier to filter out data. My understanding is that the main reason to use RMU instead of training a classifier is that you save 50% of your compute by using the same model both for classification and distillation, is that right? If this is really the main value add, my guess is that there are better strategies (e.g. training a value head for classification, or re-using the KV cache for classification).
To the extent that training models to produce noise / have high NTP loss on harmful tokens is load-bearing, I think this is another cool idea that can be applied independently. In other words, “using unlearning techniques like GradDiff/MaxEnt during pretraining” might be a really powerful technique.
I find the UNDO results underwhelming, but I like the idea. My guess is that there are ways you could use 1% of pre-training compute to train a model with near-perfect robust forget accuracy by being more targeted in where you add noise.
On a lighter note, I feel like many people here are much more sympathetic to “power concentration bad” when thinking about the gradual decline of democracy than when facing concerns about China winning the AI race. I think this is mostly vibes, I don’t think many people are actually making the mistake of choosing their terminal values based on whether it results in the conclusion “we should stop” vs “we should go faster” (+ there are some differences between the two scenarios), but I really wanted to make this meme:
I think I have answered to some of your objections in answer to another comment.
I think we would not resolve our disagreement easily in a comment thread: I feel like I am missing pieces of the worldview which make me make wrong predictions about our current world when I try to make back-predictions (e.g. why do we have so much prosperity now if coordination is so hard), and I am also disagreeing on some observations about the current world (e.g. my current understanding of the IPCC report is that it is much less doomy about our current trajectory than you seem to suggest). I’d be happy to chat in person at some point to sort this out!
I mostly view password-locking as a model organism of how sandbagging reacts to fine-tuning, and this paper shows that it is probably a good one given that the anti-refusal and SFT results are similar? I think it would have been interesting to see if the sample efficiency of fine-tuning is very different for pwd-locking and circuit breaking.
I think it’s possible to be more conservative SFT-wise using approaches like tamper resistant fine-tuning. Seems slightly unfair since I don’t expect gradient hacking to be possible though.
Additionally, it looks like your password-locked models are much worse than the ones we used. We tested our Math password-locked model (released on huggingface) against few-shot prompting and very basic jailbreak and found that it did not result in any performance uplift. One difference is that we also trained with some “negative” passwords, and I think we trained for longer + with direct logprobs distillation instead of just training on hard tokens. Maybe Math is also easier to sandbag robustly than code/MMLU, hard to tell.
Figure from the Neurips version of the password-locked model paper:
(After k=7, the context is full which results in truncated answers)
Ok I see, it seems plausible that this could be important, though this seems much less important than avoiding mistakes of the form “our reward model strongly prefers very bad stuff to very good stuff”.
I’d be surprised if this is actually how reward over-optimization goes badly in practice (e.g. I’d predict that no amount of temperature scaling would have saved OpenAI from building sycophantic models), and I haven’t seen demos of RLHF producing more/less “hacking” when temperature-scaled.
Another mitigation against reward hacking (inspired by this):
Train the model to generate 2 responses:
A reward maximizing response --> Use that when computing the outcome based rewards
An “expected” response --> Optionally score that using some high quality human feedback
This might beat “just use high quality human feedback after RL” because
You don’t need to fight as hard against the dishonesty you instilled with outcome based rewards
The self-criticism might incentivize the right kind of personality
You get sth untrusted-monitoring-shaped for free-ish
This is similar to “tell the model to reward hack during training” mixed with “use high quality feedback at the end” but it’s all intermixed throughout training and there’s self critique.
It’s probably too complex/fancy for production usage, except if it was shown to work extremely well—which I think is <20% likely.
What training setup are you imagining? Is it about doing RLHF against a reward model that was trained with unsupervised elicitation?
In this setting, I’d be surprised if having an overconfident reward model resulted in stronger reward hacking. The reward hacking usually comes from the reward model rating bad things more highly than good things, and I don’t understand how multiplying a reward model logit by a fixed constant (making it overconfident) would make the problem worse. Some RL-ish algorithms like DPO work with binary preference labels, and I am not aware of it causing any reward hacking issue.
It’s at least related. Like CCS, I see it as targeting some average-case ELK problem of eliciting an AIs “true belief” (+ maybe some additional learning, unsure how much) in domains where you don’t have ground truth labels.
My excitement about it solving ELK in practice will depend on how robust it is to variations that make the setting closer to the most important elicitation problems (e.g. situations where an AI knows very well what the humans want to hear, and where this differs from what it believes to be true).
This is very exciting!
I was wondering if this was in part due to few-shot prompting being weird, but when using the same kind of idea (in particular looking for labels using sth like argmin_{consistent set of labels y} k_fold_cross_validation(y)) to train probes in classic truth probes settings, I also get ~100% PGR:
(Llama 2 13B, layer 33⁄48, linear probe at the last position, regularization of C = 1e-3, n=100 train inputs with 0 label, n=100 test labels, using the prompt format and activation collection code from this, and using max(auroc, 1-auroc). These were the first hyparams I tried.)
(ag_news has a weird prompt format, which might explain why the truth is less salient there and the method fails)
I think that there is something very interesting going on with consistency + mutual predictability, and it’s not clear to me why this works so well.
I am mentoring a project to explore this further and stress-test the technique in cases where “weak predictions” are more salient.
I think it would be great to have even non-calibrated elicitation!
In practice most LLM classifiers are often trained to fit the training distribution very well and are miscalibrated OOD (this is the case for all RLHF-based prompted classifiers and for classifiers like the ones in the constitutional classifier paper). Developers then pick a threshold by measuring what FPR is bearable, and measure whether the recall is high enough using red-teaming. I have not seen LLM developers trust model stated probabilities.
But maybe I am missing something about the usefulness of actually striving for calibration in the context of ELK? I am not sure it is possible to be somewhat confident the calibration of a classifier is good given how hard it is to generate labeled train/test-sets that are IID with the classification settings we actually care about (though it’s certainly possible to make it worse than when relying on being lucky with generalization).
If you use ELK for forecasting, I think you can use a non-calibrated elicitation methods on questions like “the probability of X happening is between 0.4 and 0.5” and get the usefulness from calibrated classifiers.
I agree, the net impact is definitely not the difference between these numbers.
Also I meant something more like P(doom|Anthropic builds AGI first).I don’t think people are imagining that the first AI company to achieve AGI will have an AGI monopoly forever. Instead some think it may have a large impact on what this technology is first used for and what expectations/regulations are built around it.
What if someone at Anthropic thinks P(doom|Anthropic builds AGI) is 15% and P(doom|some other company builds AGI) is 30%? Then the obvious alternatives are to do their best to get governments / international agreements to make everyone pause or to make everyone’s AI development safer, but it’s not completely obvious that this is a better strategy because it might not be very tractable. Additionally, they might think these things are more tractable if Anthropic is on the frontier (e.g. because it does political advocacy, AI safety research, and deploys some safety measures in a way competitors might want to imitate to not look comparatively unsafe). And they might think these doom-reducing effects are bigger than the doom-increasing effects of speeding up the race.
You probably disagree with P(doom|some other company builds AGI) - P(doom|Anthropic builds AGI) and with the effectiveness of Anthropic advocacy/safety research/safety deployments, but I feel like this is a very different discussion from “obviously you should never build something that has a big chance of killing everyone”.
(I don’t think most people at Anthropic think like that, but I believe at least some of the most influential employees do.)
Also my understanding is that technology is often built this way during deadly races where at least one side believes that them building it faster is net good despite the risks (e.g. deciding to fire the first nuke despite thinking it might ignite the atmosphere, …).
Cool replication!
Sad that cost was such a constraint. Especially since prompting experiments are among the cheapest experiments to run! I wonder if we should have tried harder to craft a minimal example of alignment faking that was less expensive to run. Some ways we could have done that:
Use a smaller system prompt (I suspect some of the long few-shot examples can be removed / replaced by prompting?)
Use prompt caching when possible
Use a single generation (I don’t know if you are using the version of the setup with “<NEXT/>”)
This might also help to do more extensive prompt variations.
This suggests that you can get quite a lot of mileage out of tinkering with prompts – maybe small inconsistencies or seemingly unimportant details can matter a lot.
It’s scary that depends so much on random details! From further experiments (to be released soon), we found that Claude 3 Opus was the only model that was relatively consistent in its alignment faking (it also happens to be the most expensive model).
It feels to me like there’s a clear hierarchy where the model is just barely not complying with some queries and is dead set against others.
We found similar results during the investigations of the first paper. For Claude 3 Opus, it mostly reduces the fraction of inputs on which it considers faking alignment, but it never quite drives it to zero even for very egregious queries like asking for severe racial slurs (which was very surprising to me)! See the appendix on the “prefix” variation for more details.
I still think this is a serious concern that prevents us from finding “the truth predictor”, but here is a way around it:
is_it_bad(x) = max_predictor consistency(predictor) subject to the constraint that predictor(x) ≠ predictor(obvious good inputs)
For x such that all coherent (and high salience) views predict it is good, then it won’t be possible to find a consistent and salient predictor that predicts that x has a different label from obvious good inputs and thus is_it_bad(x) will be low;
For x that the model thinks is bad (regardless of what other coherent views think, and in particular coherent human simulators!), then is_it_bad(x) will be high!
For x such that some coherent view (e.g. the human simulator) thinks is bad, then is_it_bad(x) will be high.
The proposed is_it_bad has no false negatives. It has false positives (in situations where the human simulator predicts the input is bad but the model knows it’s fine), but I think this is might fine for many applications (as long as we correctly use some prior/saliency/simplicity properties to exclude coherent predictors that are not interesting without excluding the ground truth). In particular, it might be outer-alignment safe-ish to maximize for outcome(x) - large_constant * is_it_bad(x).
I agree white-box control is promising!
I think it’s not obvious that probes will be more reliable than trusted monitors for most tasks you care about, and I think it will be hard to evaluate, so I expect that you will probably want to combine white-box monitors and weaker trusted monitors. In particular, I think it’s not obvious with current evidence that you could get good probes for things that are very hard to get good trusted monitors for (e.g. sandbagging). I don’t think it’s hopeless, and I am excited for future work studying this direction (even though building analogous model organisms is rough here).
Nit: I usually think some the effect of adversarial training can be captured by loss evaluations—the main reason why adversarial training helps is the same reason why behavioral red-teaming might help (models can’t fit the conditional policy well), which is black-box-y.
Slightly late update: I am in the middle of projects which would be hard to do outside Anthropic. I will push them for at least 2-4 more months before reconsidering my decision and writing up my views.
If I remember correctly that’s what I did.
is brittle
You might be interested in an Appendix with further evaluations about whether LLMs really believe what they are told, which provides some evidence that the false fact is not believed in a very superficial way.
While I think these evals provide decent evidence about what the LLMs in the experiment “actually believe” (I don’t expect there to be natural situations where these weak models figure out the facts are false in a single forward pass when they “try harder”), I am pessimistic that we would be able to trust this sort of behavioral evals against powerful and situationally aware LLMs.
One eval which I think would give much stronger guarantees even against powerful and situationally aware LLMs is iid training (teach 100 false facts, train the model to distinguish true from false facts using 50 of them, and see if you generalize to the other 50), similar to what Aghyad and I did in The case for unlearning that removes information from LLM weights.
I think this is a much harder standard to meet (I think the models in the experiments in the blog post don’t meet this standard, just like the LLMs that were taught false facts in my unlearning paper), but I think it gives you the guarantees you want (because I think iid training is very sample efficient at making LLMs “actually try”).
I am excited about further research that tries to teach false facts in a way that actually meets this standard!
Clarifying what I mean by UNDO being underwhelming: you might have hoped that you could use 1% of pre-training compute to train a model with near-perfect robust forget accuracy. But if you want to get to even 50% robustness with UNDO you need to spend around 30% of the train-from-scratch compute, which is brutally expensive.