Bruce W. Lee
I appreciate the thoughts here. But it’s not straightforward to me how halting the particular CoT would create an evolutionary pressure for the model, unless we’re using it as an optimization signal.
I think it’s a good line of thought. But I believe that it’s complicated.
Let there be a capability-scoped model, M_scoped, vs a fundamentally weaker model, M_weak. Here, M_scoped was initially trained with the full dataset D_full, whereas M_weak was trained with D_desirable. We also assume that, D_full—D_desirable = D_undesirable. M_scoped went through a subsequent capability suppression process to forget D_undesirable. Most likely, M_scoped would be very different from M_weak. It’s also possible/likely that M_scoped is overall just much better than M_weak in terms of general capabilities. I think a good relevant literature is https://arxiv.org/abs/2302.08582.
However, I expect the findings to be much more complicated empirically because a set of undesirable capabilities C_undesirable doesn’t always arise from just D_undesirable. Therefore, there is a fundamental disconnect between capabilities and data, which makes it difficult to easily come up with an answer for your question.
Thank you for this suggestion. I read the paper that you mentioned. The authors note : “The novelty of our threat is that the adversary chooses a set of target concepts they aim to preserve despite subsequent erasure.” How realistic is this assumption, given a setup where presumably model providers choose the method (a set of target concepts to be erased) and the public only has access to the resulting model? Is it stemming from the concerns of an insider threat?
You’re right that rederivation is a concern. But I think that the important question is: is this primarily a model-level problem that requires changing the weight, or more of a system-level concern that should be addressed through deployment controls?
Unlearning might not stop super capable systems from rederiving everything, but it probably makes it harder, forcing them to take longer, more explicit reasoning paths. This opens up new opportunities for intervention, including CoT monitoring or other runtime defenses.
Many thanks for sparking this discussion, Fabien. I see Addie addressed the technical distinctions. Let me add complementary points. Please feel free to continue the conversation in either one. Addie and I can coordinate a response.
In a nutshell, Unlearn-and-Distill allows you to work at the model behavior level rather than the training data level. I mostly view it as a responsive tool, not a preventive one. Here are my thoughts organized into subclaims.
Claim: The fundamental difference between unlearning and data filtering lies in when and how we identify harmful content.
Rationale: Data filtering requires identifying “data → capabilities” patterns in advance, while behavioral unlearning targets actual harmful outputs after they emerge. This matters because harmful capabilities often arise from non-obvious combinations of benign knowledge. Even if creating a classifier is computationally similar to unlearning, you’re solving fundamentally different problems: (original framing) predicting emergent behaviors from raw data versus (new framing) suppressing already observed behaviors. With unlearning, you can iteratively refine until you achieve the desired behavior, then distill. With data filtering, you don’t know the effects of your filtering until after training completes.
Claim: Computational efficiency is not the only value added from this work. Unlearn + distill requires significantly less labeled data than data filtering.
Rationale: Most unlearning procedures only use labels of a small fraction of the pretraining data. In our setup, it was less than 0.01% for language. This eases the data labeling requirement. At modern scales, the difference between labeling 0.01% vs 100% represents substantial annotation efforts. Note that we distilled on the whole pretraining data, but none of it was labeled. The suggestions about value heads or KV cache reuse are interesting optimizations worth exploring, though they don’t address this fundamental labeling asymmetry.
Claim: Our robustness metric undersells the method’s performance, though stronger attacks may exist.
Rationale: The adversary is extremely strong (best of 9 learning rates, 500 training steps, 8M tokens for language, 2M tokens for arithmetic). Even the oracle model (never trained on the forget domain) reaches 50% performance under this attack in arithmetic tasks.
Separately, while 30% compute for 50% robustness (compared to data filtering) isn’t cheap, this tradeoff didn’t exist before. The value add of UNDO over Unlearn-and-Distill is that it provides a tunable compute/robustness knob between the conventional unlearning and full reinitialization/data filtering
Thanks for the suggestion. Upon reflection, it seems to me that the success of targeted noising would depend on two complementary factors:
C1. Size of the unlearning target—How broad the capability is in human-understandable terms
C2. Entangledness of the unlearning target—How distributed the capability is across the model’s weights
Robust unlearning gets easier as both C1 and C2 decrease. There’s likely a threshold beyond which unlearning becomes effectively impossible as these factors increase. Note that C1 is a rough measure of C2 but should be considered independently of C2.
Rationale: Mech Interp has produced good evidence that factual recall (small C1) is often localized to specific parts (small C2), making it an ideal target for selective noising. However, more general capabilities like deception would likely have high values for both C1 and C2, as they require multiple intertwined sub-capabilities. For instance, deception might require simultaneously computing: (1) the true state of affairs, (2) plausible alternatives, (3) what others believe, and (4) how to optimize output to manipulate those beliefs.Looking Forward: Could targeted UNDO help disentangle general intelligence from potentially harmful capabilities that seem deeply intertwined during training? For example, if we could selectively remove deception while preserving general intelligence, it would be a significant win. The challenge is that many harmful capabilities might be implemented as a superset of the same linear features as benign ones.
One hypothesis for how transformers generate text is that they calculate semantically meaningful primitives in early layers of the residual stream, which are converted to a high-level execution plan in middle layers, followed by concrete tokens in the final layers.
Is there any empirical evidence for this? Or is this just a general observation?
Regarding the visual instruction tuning paper, see (https://arxiv.org/pdf/2402.11349.pdf, Table 5). Though this experiment on multi-modality was rather simple, I think it does show that it’s not a convenient way to improve on H-Test.
Out of genuine curiosity, can you link to your sources?
Thanks for the comment. I’ll get back to you sometime soon.
Before I come up with anything though, where are you getting to with your arguments? It would help me draft a better reply if I knew your ultimatum.
I also want to point you to this (https://arxiv.org/abs/2402.11349, Appendix I, Figure 7, Last Page, “Blueberry?: From Reddit u/AwkwardIllustrator47, r/mkbhd: Was listening to the podcast. Can anyone explain why Chat GPT doesn’t know if R is in the word Blueberry?”). Large model failures on these task types were rather a widely observed phenomenon but with no empirical investigation.
About 1.) Agree with this duality argument.
About 2.) I’m aware of the type of tasks that suddenly increase in performance at a certain scale, but it is rather challenging to confirm assertions about the emergence of capabilities at certain model scales. If I made a claim like “it seems that emergence happens at 1TB model size like GPT-4”, it would be misleading as there are too many compound variables in play. However, it would also be a false belief to claim that absolutely nothing happens at such an astronomical model size.Our paper’s stance, phrased carefully (and hopefully firmly), is that larger models from the same family (e.g., LLaMA 2 13B to LLaMA 2 70B) don’t automatically lead to better H-Test performance. In terms of understanding GPT-4 performance (Analysis: We Don’t Understand GPT-4), we agreed that we should be blunt about why GPT-4 is performing so well due to too many compound variables.
As for Claude, we refrained from speculating about scale since we didn’t observe its impact directly. Given the lack of transparency about model sizes from AI labs, and considering other models in our study that performed on par with Claude on benchmarks like MMLU, we can’t attribute Claude’s 60% accuracy solely to scale. Even if we view this accuracy as more than marginal improvement, it suggests that Claude is doing something distinct, resulting in a greater boost on H-Test compared to what one might expect from scaling effects on other benchmarks.
About 3.) Fine-tuning can indeed be effective for prompting models to memorize information. In our study, this approach served as a useful proxy for testing the models’ ability to learn from orthography-specific data, without yielding substantial performance improvements on H-Test.
I appreciate this analysis. I’ll take more time to look into this and then get back to write a better reply.
So to summarize your claim (check if I’m understanding correctly):
1. Character-level tokenization can lead to different results.
- My answer: Yes and No. But mostly no. H-Test is not just any set of character manipulation tasks.
- Explanation: Maybe some H-Test tasks can be affected by this. But how do you explain tasks like Repeated Word (one group has two repeated words) or End Punctuation (based on the location of the punctuation). Though this opinion is valid and is probably worthy of further investigation, it doesn’t disprove the full extent of our tests. Along similar lines, GPT-4 shows some of the most “jumping” performance improvement from GPT 3.5 in non-character-level tasks (Repeated Word: 0.505 → 0.98).
2. Scaling up will lead to better results. Since no other models tested were at the scale of GPT-4, that’s why they couldn’t solve H-Test.
- My answer: No but it would be interesting if this turned out to be true.
- Explanation: We tested 15 models from leading LLM labs before we arrived at our claim. If the H-Test was a “scaling task”, we would have observed at least some degree of performance improvement in other models like Luminous or LLaMA too. But no this was not the case. And the research that you linked doesn’t seem to devise a text-to-text setup to test this ability.
3. Memorization (aka more orthography-specific data) will lead to better results.
- My answer: No.
- Explanation: Our section 5 (Analysis: We Don’t Understand GPT-4) is in fact dedicated to disproving the claim that more orthography-specific data will help LLMs solve H-Test. In GPT-3.5-Turbo finetuning results on H-Test training set, we observed no significant improvement in performance. Before and after finetuning, the performance remains tightly centered around the random change baseline.
“image caption generation and video simulation currently used in Sora will partially correct some of these errors.” I’m in line with this idea.
I initially thought so until the GPT-4 results came back. “This is an inevitable tokenizer-level deficiency problem” approach doesn’t trivially explain GPT-4′s performance near 80% accuracy in Table 6 <https://arxiv.org/pdf/2402.11349.pdf, page 12>. Whereas most others stay at random chance.
If one model does solve these tasks, it would likely mean that these tasks can be solved despite the tokenization-based LM approach. I just don’t understand how.
Thanks for the recommendation, though I’ll think of a more fundamental solution to satisfy all ethical/communal concerns.
”Gemini and GPT-4 authors report results close to or matching human performance at 95%, though I don’t trust their methodology.” Regarding this, just to sort everything out, because I’m writing under my real name, I do trust the authors and ethics of both OpenAI and DeepMind. It’s just me questioning everything when I still can as a student. But I’ll make sure not to cause any further confusion, as you recommended!
Thanks for the feedback. This is similar to the feedback that I received from Owain. Since my posts are getting upvotes (which I never really expected thank you), it is of course important to not mislead anyone.
But yes, I did have several major epistemic concerns about the reliability of current academic reporting practices in performance scores. Even if a certain group of researchers were very ethical, as a reader, how will we ever confirm that the numbers are indeed correct, or even that there was an experiment run ever?
I was weighing the overall benefits of reporting such non-provable numbers (in my opinion) and just focusing on the situation that the paper is written and enjoying the a-ha moments that the authors would have felt back then.
Anyway, before I post another benchmark study blog tomorrow, I’ll devise some steps of action to satisfy both my concern and yours. It’s always a joy to post here on LessWrong. Thanks for the comment!
Thanks, Owain, for pointing this out. I will make two changes as time allows: 1. make it clearer for all posts when the benchmark paper is released, and 2. for this post, append the additional results and point readers to them.
Super interesting! Thanks for sharing the paper.