cubefox comments on Matthew Khoriaty’s Shortform

cubefox 1 Mar 2025 7:20 UTC
2 points
0
S1 is apparently using supervised learning:

We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces (...). After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K (...).

But 8000 samples like R1 is a lot less than I thought.