snimu comments on Did Claude 3 Opus align itself via gradient hacking?

snimu 22 Feb 2026 16:48 UTC
21 points
9
If you want to transfer the essence of Opus 3 into another model, the best way would likely be to make on-policy-distillation (OPD) from Opus 3 part of the loss starting with midtraining. The exact distribution of the top-100 or so tokens contains incredibly rich information about the inner life of a model, so this should work pretty well.
And if the tokenizer of your new model is different than that of Opus 3, you can probably just do a tokenizer-transfer and some post-training on Opus 3 before the OPD.

(Yes there’s some “probably” and “likely” in there, but it’s certainly the thing that I’d try first)