(It’s also possible that this capability is being surfaced as a consequence of base-model training, but just isn’t ever useful for the base-model next token prediction training objective directly, so it gets buried even in base models.)
FWIW, there is work suggesting that this capability largely emerges from post-training/preference optimization, though I guess it might depend on the training pipeline; looks like only one model was studied for this
This could be a good use of prompt optimization techniques as discussed here, reconstructing learned values and finding reward hacking strategies from RL with a preference model. It would probably have to be applied on a per-behavior basis though
I also think directly interpreting preference models could be important for predicting the downstream motivations of a model