cubefox comments on Why not train reasoning models with RLHF?