I don’t think anything in their training incentivizes self-modeling of this kind.
RLVR probably incentivizes it to a degree. It’s much easier to make the correct choice for token 5 if you know how each possible choice will affect your train of thought for tokens 6-1006.
RLVR probably incentivizes it to a degree. It’s much easier to make the correct choice for token 5 if you know how each possible choice will affect your train of thought for tokens 6-1006.