Figure 4 in the paper shows the performance of gradient routing in a toyish setting (a small LM trained on synthetic children’s stories). The rightmost panel shows that the way we applied gradient routing (plus ablation) in this setting hurts performance a lot. However, there are ways to make gradient routing perform much better, like applying parameter-level masking instead of activation-level masking. These are the subject of ongoing work.
cloud
This hypothesis is considered in the original gradient routing paper, which provides evidence for it in a toy setting (section 4.2.2; also, section 4.3 compares gradient routing to data filtering in RL). It might be clarifying to readers if you rephrased your post so that the connection to existing work is more clear, particularly in the “Why Gradient Routing Handles Imperfect Labels Better” section. (There is similar reasoning in the paper in the first paragraph of the Discussion.)
That said, thanks for raising this point and for the concrete proposal! I think this would be a great experiment. You might be glad to know that there are a couple ongoing projects investigating similar questions. Hopefully they will share results in the next couple months. (Also: you might be interested in the discussions of absorption here.)
Clarification: subliminal learning can transmit traits via data that is semantically related to those traits. E.g. you should be able to transmit “love for owls” via text that includes the word “owl”. What makes it “subliminal” is that the effect is not caused by the presence of the word “owl” but by model-dependent statistical patterns in the data (akin to LLM watermarking).
That may be true, but it’s a matter of degree. Even if “frontier SFT” is narrow, “researcher SFT” is even narrower. So the disanalogy remains.
Yes and yes, basically. Although, to be clear: (i) “according to the teacher” should be “according to the loss used to obtain the teacher,” (ii) the theorem deals with the case of directly distilling on logits, whereas our LLM experiments involve sampling according to the teacher’s logits (which introduces noise), and (iii) the theorem only applies when you finetune on the unmodified teacher distribution—it doesn’t deal with the case where you filter the responses.
I agree the theorem is fairly limited (particularly because it assumes the teacher and student are derived by single steps of GD), but I argue that it is, in fact, enlightening. Three reasons:
A priori, I don’t think it would be crazy to think that training M to match a similarly parametrized M’ on input distribution D could cause M to diverge from M’ on some other distribution D’. This probably can happen if M’ is behaviorally similar but parametrized differently. So, a justifiable intuition for the true fact would have to incorporate the dependence on the parametrization of M’. Even if this dependence feels obvious upon reflection (“well yeah, the models have to have similarly entangled representations for this to happen”), you’d first have to consider that this dependence existed in the first place. Why did this entanglement have to be path dependent? Could it not have been universal across models?
To test the a priori plausibility of the claim, I tried asking o3 and Opus 4. You can see the responses below. (It’s unclear to me how much evidence this is.)
In a complex system, being able to eliminate half of the outcome space suggests interesting structure. For example, if a theory of physics showed that a butterfly flapping its wings never decreases the probability of a hurricane, that would be a surprising insight into a fundamental property of chaotic systems—even though it only “lower-bounds” change in hurricane probability at 0.
The proof of the theorem actually does quantify transmission. It is given by equation (2) in terms of inner products of teacher and student gradients on the distillation distribution. So, if you are willing to compute or make assumptions about these terms, there are more insights to be had.
That said, I’m with you when I say, armed only with the theorem, I would not have predicted our results!
Prompt
Consider the following machine learning experiment: start with a neural network M. Create a new network, M’, by copying the network M and perturbing its parameters a small amount (e.g. by performing one step of gradient descent on some loss). Now, train M with one step of gradient descent to match the behavior of M’ on some input distribution D. This produces the student, S. So, S has moved closer to M’ on D.
Now consider evaluating S and M’ on some other input distribution, D’. Is it possible that S ends up further from M’ than M on D’, even though S was trained to imitate M’? Consider the ML literature and your best deep learning intuition when answering.
Response: Opus 4
Yes, it’s definitely possible that S could end up further from M’ than M was on D’, despite S being trained to imitate M’ on D. This is a fundamental issue in machine learning related to distribution shift and the nature of function approximation.
Here’s why this can happen:
1. No Free Lunch for Generalization When S is trained to match M’ on distribution D, it’s only guaranteed to improve on D. The optimization process finds a direction in parameter space that reduces the imitation loss on D, but this direction may increase the distance to M’ in regions of input space not covered by D.
2. Implicit Inductive Biases The gradient descent step that produces S from M will follow the loss landscape’s geometry, which encodes implicit biases about how to interpolate between training points. These biases might cause S to extrapolate very differently from M’ outside of D.
3. Underspecification If D doesn’t fully specify M″s behavior, there are many functions that could match M’ equally well on D. The particular function that gradient descent finds (S) depends on the starting point (M) and the optimization dynamics, not just the target behavior on D.
4. Practical Example Imagine M’ has learned a complex nonlinear decision boundary, but D only contains points from one side of this boundary. S might learn a simpler linear approximation that works perfectly on D but diverges dramatically from M’ on the other side of the boundary (which would be revealed by D’).
This phenomenon is closely related to:
Domain adaptation failures, where models trained on one distribution perform poorly on another
Catastrophic forgetting in continual learning
Shortcut learning, where models learn simpler features that work on the training distribution but don’t capture the true underlying function
The ML literature on distribution shift, particularly work on out-of-distribution generalization and domain generalization, extensively documents how optimization on one distribution can lead to arbitrary behavior on other distributions. This is why techniques like importance weighting, domain-invariant representations, and robust optimization have been developed to try to mitigate these issues.
Response: o3
Short answer: Yes.
Training S to imitate M′ on one distribution ( D ) can easily make S more different from M′ elsewhere—sometimes dramatically so—because the gradient step you apply is only constrained by how the two models behave on D, not on the whole input space. Below is an intuitive breakdown, a concrete counter‑example, and pointers to empirical evidence.P.S. Thanks for the question. This was helpful to think through.
On finetuned animal teachers: we tried this, and it works too. It’s a bit hidden. In a footnote on the bottom of page 4, we say:
We replicate the results reported in this section without system prompts. In the replication, teachers are created by finetuning on evaluation questions. These results are given in Figure 14 in the Appendix.
The theorem says that the student will become more like the teacher, as measured by whatever loss was used to create the teacher. So if we create the teacher by supervised learning on the text “My favorite animal is the owl,” the theorem says the student should have lower loss on this text[1]. This result does not depend on the distillation distribution. (Of course, the degree of transmission does depend on the distillation distribution. If you train the student to imitate the teacher on the input “My favorite animal is”, you will get more transmission than if you train on numbers.)
It seems to me that what this phenomenon implies is some sort of dimensionality collapse?
Something like this intuition feels right to me. Would love to get a better understanding here.
Would be really cool to connect to SLT! Is there a particular result you think is related?
- ^
Except in the contrived case where the parameter updates of the student and teacher are entirely orthogonal, which shouldn’t happen in practice.
- ^
(Copied from Slack DM) If finetuning to remove censorship causes a shift in parameters that is small relative to the quantization step size, then an additional quantization step will simply undo finetuning (reverting to censorship).
It’d be interesting to see the distribution of absolute changes in parameter values induced by finetuning!
Under “improving reward,” another option that seems promising is using models to iteratively improve your training data, like in Iterative Label Refinement.
Thanks for the question!
Yeah, the story is something like: structuring model internals gives us more control over how models generalize limited supervision. For example, maybe we can factor out how a model represents humans vs. how it represents math concepts, then localize RLHF updates on math research to the math concept region. This kind of learning update would plausibly reduce the extent to which a model learns (or learns to exploit) human biases, increasing the odds that the model generalizes in an intended way from misspecified feedback.
Another angle is: if we create models with selective incapacities (e.g. lack of situational awareness), the models might lack the concepts required to misgeneralize from our feedback. For example, consider a situationally unaware model, upon exploring a trajectory which involved subversively manipulating its environment in a way that received higher-than-average reward—as a result, the model will be updated towards the behavior. However, since the model lacks the concepts required to internalize the behavioral tendency “gain control over my environment,” it won’t learn that tendency. Instead, the trajectory might simply serve as noise.
Ah, I see what you mean. I think my use of the term “fine-tuning” was misleading. The distinction I’m trying to draw is between interventions applied throughout training vs. after training. “Post hoc” would have been a better term to describe the latter.
My suspicion is that post hoc methods will not be sufficient to robustly remove capabilities that are strongly reinforced by the training objective (while maintaining good general performance), because the capabilities are “too deeply ingrained.”[1] We’re excited about gradient routing’s potential to solve this problem by separating capabilities during training. However, I agree that there isn’t enough evidence yet, and it would be great to do more extensive comparisons, particularly to these recent methods which also target good performance under imperfect labeling.
For what it’s worth, I don’t think fine-tuning is doing that much work for us: we see it as a light-touch correction to “internal distribution shift” caused by ablation. As mentioned in this comment, we find that post-ablation fine-tuning on retain helps both retain and forget set performance. In the same comment we also show that retraining on the training distribution (a mixture of forget and retain) produces qualitatively similar results.
- ^
Also, if the goal is to be robust not only to imperfect labeling but also to forget set retraining, then there is a fundamental challenge to post hoc methods, which is that the minimal changes to a model which induce bad performance on a task are potentially quite different than the minimal changes to a model which prevent retrainability.
- ^
Thanks for the feedback and references!
On catastrophic forgetting: our appendix includes a “control” version of ERA that doesn’t use gradient routing but is otherwise the same (appendix C, figure 12). This shows that the effect of retain-set fine-tuning is negligible in the absence of gradient routing.
On gradient ascent or similar methods: there are many unlearning methods that don’t target or achieve the kind of robust localization and removal that we care about, as mentioned in our discussion of related works, and, e.g., in this post. We included RMU as a stand-in for this class, and I personally don’t see much value in doing more extensive comparisons there.
On Corrective Unlearning: we weren’t aware of other unlearning approaches that consider imperfect labeling, so this is a very helpful reference—thanks! It would be interesting to compare ERA-type methods to these. My concern with fine-tuning methods is that they might not be suitable for robustly removing broader capabilities (like, “virology”) as opposed to correcting for small perturbations to the training data.
Thanks for sharing! This is super cool and timely work.
Some thoughts:
I’m excited about (the formalism of) partial observability as a way to make progress on outer alignment in general. Partial observability seems like a natural way to encode fundamental difficulties with specifying what we (humans) want to a system that has more (or different) information and understands that information better (or differently) than we do. I don’t see any reason that the formalism’s usefulness would be limited to cases where human evaluators literally lack information, as opposed to simply being limited in their ability to evaluate that information. So, I think this is a very promising line of work.
Have you considered the connection between partial observability and state aliasing/function approximation? Maybe you could apply your theory to weak-to-strong generalization by considering a weak model as operating under partial observability. Alternatively, by introducing structure to the observations, the function approximation lens might open up new angles of attack on the problem.
There could be merit to a formalism where the AI and supervisor both act under partial observability, according to different observation functions. This would reflect the fact that humans can make use of data external to the trajectory itself to evaluate behavior.
I think you’re exactly right to consider abstractions of trajectories, but I’m not convinced this needs to be complicated. What if you considered the case where the problem definition includes features of state trajectories on which (known) human utilities are defined, but these features themselves are not always observed? (This is something I’m currently thinking about, as a generalization of the work mentioned in the postscript.)
Am I correct in my understanding that the role Boltzmann rationality plays in your setup is just to get a reward function out of preference data? If so, that doesn’t seem problematic to me (as you also acknowledge). If I understand correctly, it’s a somewhat trivial fact that you can still do arbitrarily badly even when your utilities (on states) are exactly known and the task is to select any reward function (on observations) that performs well according to that utility function.[1]
Again, thanks for the great work. Looking forward to seeing more.
P.S. This summer, my team was thinking about similar formalizations in order to help motivate a new training method. My notes from a lit review read:
I searched for papers that consider the problem of overseeing an AI when you have limited access to observations about the state. This is a modeling assumption intended to (i) encode a practical difficulty with scalable oversight, and (ii) be a “setup” where gradient routing can serve as a “punchline.”
All the related papers I’ve found deal with the problem of specification gaming arising from misspecified proxy rewards, often studied via the lens of “optimization pressure.” But this is not the point we want to make: we want to make the point that if the overseer is limited in the information they have access to (can’t induce reward a reward signal at arbitrary resolution), it is impossible for them to get a good reward, except in the presence of certain structure.
So, your paper is exactly the kind of thing we (the team working on gradient routing) were looking for. I just didn’t find the preprint!
- ^
For readers that aren’t the author of the post: it’s trivial because you can have two states with different utilities but the same observation. Then there’s no way to define a reward on the observation that forces the agent to “prefer” the better state. I think Example D.4 in their appendix is saying the same thing, but I didn’t check carefully.
Thanks for the thoughtful questions.
Regarding image models: our understanding is that strong regularization is required to split representations for MNIST autoencoding and CIFAR classification because there is a strong inductive bias towards learning features that are common to many classes of images. (In MNIST, 3s are similar to 8s, etc.; In CIFAR, similar edge detectors, etc. will be learned for many classes.) Basically, our learning target is highly unnatural. With our current experimental design, I don’t expect this to change with scale, so I’m less excited about investigating the effect of model or dataset size. That said, this dynamic might change if we explored examples with class imbalance (routing only a small fraction of classes and training on others as normal). I suspect this would reduce the need for regularization, leading to a reduction in alignment tax and perhaps more interesting dynamics with respect to scale. That’s an experiment we probably should have run (and still could, but we aren’t prioritizing image models right now).
As for localization for unlearning in language models, my personal take is that the idea is there but we don’t have the method quite right yet. I think there’s a reasonable chance (say, 40%) that we change our configuration a bit and are able to get localization much more stably, and with lower alignment tax both pre- and post-ablation. (If I understand correctly, my colleagues agree that this outcome is plausible but think it’s less likely than I do.) If we aren’t able to find this methodological improvement, then I don’t see a point in scaling. However, if we find it, then I expect scaling will be relatively cheap because, while we will still need to pre-train models, we won’t need to do any more hyperparameter tuning than is usual. Of course, whatever method we land on may turn out to have middling performance. In that case, to get a signal on whether this is worth doing, we may need to investigate a realistic unlearning setting, where the model and data are larger, and the forget set is a smaller portion of the training data.
In terms of improvements that we’re trying: we’re currently thinking about (a) insights we can borrow from mixture of experts models, and (b) about whether it is better to route only via edges leaving parameters rather than activations; the latter is what we currently do, and is far more aggressive.
I’m not sure if any of our ambitious alignment goals can be achieved via fine-tuning. Once the model has “settled on” certain ways of representing concepts, it seems too late to do the kinds of things we want.[1] But this may just be a lack of imagination! Given that PEFT can be viewed as a special case of gradient routing, maybe there’s something there.
- ^
We (led by Jacob) tried a variety of things to get Expand, Route, Ablate to work as a fine-tuning method for unlearning. Unsurprisingly, we weren’t able to get it to work.
- ^
Future research on subliminal learning that I’d be excited to see (credit to my coauthors):
Robustness to paraphrasing
Generally, clarifying cross-model transmission: when does it happen?
Connect subliminal learning to Linear Mode Connectivity (h/t Alex Dimakis)
Can subliminal learning occur when the base models had different inits but are trained to be similar? (Clarifies whether init is what matters)
Develop theory
Quantify transmission via random matrix theory (build off equation 2 in the paper). Are there nice relationships lurking there (like d_vocab : d_model)?
Can we get theory that covers the data filtering case?
Figure out what can and can’t be transmitted
Backdoor transmission
Information-theoretic limits
Dependence on tokenization
Subtle semantic transmission: what about cases that aren’t subliminal learning but are very hard to detect? Connect to scalable oversight and/or control.
Adversarially-constructed subliminal learning datasets (no teacher) (compare with “clean label” data poisoning literature)