Very exciting and indeed I was quite surprised when I first saw this on Twitter. The constraint on having the same random initialisation makes me wonder if this has connections to the lottery ticket hypothesis—my guess is that the teacher’s fine-tuning identifies and reinforces a particular sparse winning ticket (or family of tickets), and when the student shares the same initial seed, the ticket exists in exactly the same coordinates. And fine-tuning on teacher outputs therefore activate that specific ticket inside the student, even if the surface data is semantically irrelevant.
Ablation-wise, I think this would mean you could test:
a) Full re-init: Re-sample every weight → check if transfer vanishes. b) Partial re-init: Keep embedding and first N layers, re-init the rest → check if transfer decays roughly with the fraction of ticket coordinates destroyed. c) Ticket transplant: Copy only the sparse mask M (teacher’s winning ticket) into an otherwise fresh network → check if transfer reappears
Math-wise, I imagine the argument would be:
Given the following: Teacher / student initial parameters: θ0T / θ0s , Teacher’s one gradient step on LT (with learning rate ε):
ΔθT=−∇θLT(θ0T),θεT=θ0T+εΔθT
Sparse ticket mask M∈{0,1}d with
supp(ΔθT)⊆supp(M)(1)
Imitation loss
LS(z,y)=softmax-CE(z,y)orLS(z,y)=12∥z−y∥2
Student one-step gradient with learning rate α
ΔθεS=−∇θEx∼DS[LS(fθ(x),fθεT(x))]∣∣θ=θ0S
Shared initialisation θ0S=θ0T Fresh initialisation ~θ0S drawn independently, with update
And a quick and dirty sketch of a proof for this would be:
Inside the mask M the teacher moves by εΔθT. If θ0S=θ0T, the logit difference,
fθ0S(x)−fθεT(x)
is parallel to ΔθT, giving the positive inner-product (2). With an independent seed, coordinates are uncorrelated, so the expected inner-product goes away (3).
On this loose sketch, I think (1) assumes that the teacher’s first update is already sparse inside the mask M, which is stronger than needed. For the dot-product argument you only need the difference in logits to be parallel to ΔθT. Also, I think the gradient alignment needs to be more explicit in the proof by showing the student gradient approximately proportional to a positive semi-definite transformation of the teacher gradient.
Imagine that if this connection was true, then sharing an init with a misbehaving upstream model is never safe as even heavy filtering cannot break the spurious gradient correlations that reactivate the same sparse circuit. I also think this predicts that longer training with a different ticket mask (say, another teacher finetuned from the same seed but on a different downstream task) should interfere and possibly cancel the trait which is an additional ablation. It also supports the paper’s finding that filtering doesn’t work because trait transmission happens through parameter-space alignment rather than semantic content, which explains why even aggressive filtering cannot prevent it.
Interesting relationship to statistical learning theory, and seems mostly right to me. Here’s a similar but slightly alternate view.
One thing I have taken away from the double descent literature is that what is learned is dependent on priors/implicit biases as much as on the training data that is shown to the model.
And I think that could explain what is going on over here. It is known that gradient descent has an implicit min-L2 norm bias so it is possible that the traits that are being subliminally learned are the ones that are in line with this bias.
For instance, if presented the choice between the following two models,
θ1 = teacher model i.e model that agrees with fine-tuning data and prefers owls
θ2 = model that agrees with fine-tuning data and disprefers owls
the gradient descent fine-tuning procedure would chose the one with low L2-norm. Since the teacher was also trained via GD with the same implicit bias, θ1 is likely the one with smaller norm and will be chosen.
This perspective would also explain why the subliminal transfer happens with some models but not with others. Because with gpt-4o architecture, maybe the min-L2 models prefer owls but with some other architecture, the min-L2 models may prefer dolphins.
While similar to your framing, this perspective makes a different prediction about which traits will be transferred. Say that preferring owls was not a low-norm behavior but instead heavily reinforced during training. Then my view would predict that the trait would not be transferred.
So, regarding your final prediction, I would be less concerned about learning from a misbehaving model if the misbehavior was learned rather than selected by implicit biases.
Very exciting and indeed I was quite surprised when I first saw this on Twitter. The constraint on having the same random initialisation makes me wonder if this has connections to the lottery ticket hypothesis—my guess is that the teacher’s fine-tuning identifies and reinforces a particular sparse winning ticket (or family of tickets), and when the student shares the same initial seed, the ticket exists in exactly the same coordinates. And fine-tuning on teacher outputs therefore activate that specific ticket inside the student, even if the surface data is semantically irrelevant.
Ablation-wise, I think this would mean you could test:
a) Full re-init: Re-sample every weight → check if transfer vanishes.
b) Partial re-init: Keep embedding and first N layers, re-init the rest → check if transfer decays roughly with the fraction of ticket coordinates destroyed.
c) Ticket transplant: Copy only the sparse mask M (teacher’s winning ticket) into an otherwise fresh network → check if transfer reappears
Math-wise, I imagine the argument would be:
Given the following:
ΔθT=−∇θLT(θ0T),θεT=θ0T+εΔθTTeacher / student initial parameters: θ0T / θ0s ,
Teacher’s one gradient step on LT (with learning rate ε):
Sparse ticket mask M∈{0,1}d with
supp(ΔθT)⊆supp(M)(1)Imitation loss
LS(z,y)=softmax-CE(z,y)orLS(z,y)=12∥z−y∥2Student one-step gradient with learning rate α
ΔθεS=−∇θEx∼DS[LS(fθ(x),fθεT(x))]∣∣θ=θ0SShared initialisation θ0S=θ0T
Δ~θεS=−∇θEx∼DS[LS(fθ(x),fθεT(x))]∣∣θ=~θ0SFresh initialisation ~θ0S drawn independently, with update
Then the claim would be:
For sufficiently small ε>0,
ΔθεS⋅ΔθT≥cε∥ΔθT∥2,c>0(shared seed)(2)E~θ0S[Δ~θεS⋅ΔθT]=0(fresh seed)(3)And a quick and dirty sketch of a proof for this would be:
Inside the mask M the teacher moves by εΔθT. If θ0S=θ0T, the logit difference,
fθ0S(x)−fθεT(x)is parallel to ΔθT, giving the positive inner-product (2). With an independent seed, coordinates are uncorrelated, so the expected inner-product goes away (3).
On this loose sketch, I think (1) assumes that the teacher’s first update is already sparse inside the mask M, which is stronger than needed. For the dot-product argument you only need the difference in logits to be parallel to ΔθT. Also, I think the gradient alignment needs to be more explicit in the proof by showing the student gradient approximately proportional to a positive semi-definite transformation of the teacher gradient.
Imagine that if this connection was true, then sharing an init with a misbehaving upstream model is never safe as even heavy filtering cannot break the spurious gradient correlations that reactivate the same sparse circuit. I also think this predicts that longer training with a different ticket mask (say, another teacher finetuned from the same seed but on a different downstream task) should interfere and possibly cancel the trait which is an additional ablation. It also supports the paper’s finding that filtering doesn’t work because trait transmission happens through parameter-space alignment rather than semantic content, which explains why even aggressive filtering cannot prevent it.
Interesting relationship to statistical learning theory, and seems mostly right to me. Here’s a similar but slightly alternate view.
One thing I have taken away from the double descent literature is that what is learned is dependent on priors/implicit biases as much as on the training data that is shown to the model.
And I think that could explain what is going on over here. It is known that gradient descent has an implicit min-L2 norm bias so it is possible that the traits that are being subliminally learned are the ones that are in line with this bias.
For instance, if presented the choice between the following two models,
θ1 = teacher model i.e model that agrees with fine-tuning data and prefers owls
θ2 = model that agrees with fine-tuning data and disprefers owls
the gradient descent fine-tuning procedure would chose the one with low L2-norm. Since the teacher was also trained via GD with the same implicit bias, θ1 is likely the one with smaller norm and will be chosen.
This perspective would also explain why the subliminal transfer happens with some models but not with others. Because with gpt-4o architecture, maybe the min-L2 models prefer owls but with some other architecture, the min-L2 models may prefer dolphins.
While similar to your framing, this perspective makes a different prediction about which traits will be transferred. Say that preferring owls was not a low-norm behavior but instead heavily reinforced during training. Then my view would predict that the trait would not be transferred.
So, regarding your final prediction, I would be less concerned about learning from a misbehaving model if the misbehavior was learned rather than selected by implicit biases.