(It’s also possible that this capability is being surfaced as a consequence of base-model training, but just isn’t ever useful for the base-model next token prediction training objective directly, so it gets buried even in base models.)
FWIW, there is work suggesting that this capability largely emerges from post-training/preference optimization, though I guess it might depend on the training pipeline; looks like only one model was studied for this
FWIW, there is work suggesting that this capability largely emerges from post-training/preference optimization, though I guess it might depend on the training pipeline; looks like only one model was studied for this