I do think RL works “as intended” to some extent, teaching models some actual reasoning skills, much like SSL works “as intended” to some extent, chiseling-in some generalizable knowledge. The question is to what extent it’s one or the other.
I do think RL works “as intended” to some extent, teaching models some actual reasoning skills, much like SSL works “as intended” to some extent, chiseling-in some generalizable knowledge. The question is to what extent it’s one or the other.