I remember reading that SFT can undermine subsequent RL by inducing pseudo reasoning paths imitated from expert models (at least in Large Vision-Language Models ), do you think these results could be attributed to this behavior, or the results would be the same if only RL was used?
Alexander Olivaw
Karma: 0
Readow AI is very good at finding similar books to the one/ones you enter.
A few decades ago we were trapped in a house that was slowly sinking into quicksand over the course of decades. Now however, the house IS on fire.