I don’t think that distinction is important? I think of the reasoning stuff as just long-horizon but with the null environment of only your own outputs.
Maybe, you could define it that way. I think R1, which uses ~naive policy gradient, is evidence that long generations are different and much easier than long eposides with environment interaction—GRPO (pretty much naive policy gradient) does no attribution to steps or parts of the trajectory, it just trains on the whole trajectory. Naive policy gradient is known to completely fail at more traditional long horizon tasks like real time video games. R1 is more like brainstorming lots of random stuff that doesn’t matter and then selecting the good stuff at the end than taking actions that actually have to be good before the final output
I don’t think that distinction is important? I think of the reasoning stuff as just long-horizon but with the null environment of only your own outputs.
Maybe, you could define it that way. I think R1, which uses ~naive policy gradient, is evidence that long generations are different and much easier than long eposides with environment interaction—GRPO (pretty much naive policy gradient) does no attribution to steps or parts of the trajectory, it just trains on the whole trajectory. Naive policy gradient is known to completely fail at more traditional long horizon tasks like real time video games. R1 is more like brainstorming lots of random stuff that doesn’t matter and then selecting the good stuff at the end than taking actions that actually have to be good before the final output