One reason for this (see here) may be that RLVR models prefer giving an answer with very low probability of being correct to answering that they don’t know. (Though that doesn’t explain their inability to properly see the Pokémon game.)
One reason for this (see here) may be that RLVR models prefer giving an answer with very low probability of being correct to answering that they don’t know. (Though that doesn’t explain their inability to properly see the Pokémon game.)