We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces (...). After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K (...).
But 8000 samples like R1 is a lot less than I thought.
S1 is apparently using supervised learning:
But 8000 samples like R1 is a lot less than I thought.