Daniel Kokotajlo comments on Daniel Kokotajlo’s Shortform

Daniel Kokotajlo 20 Feb 2025 8:33 UTC
4 points
−6
I don’t think that distinction is important? I think of the reasoning stuff as just long-horizon but with the null environment of only your own outputs.
- Tao Lin 20 Feb 2025 19:27 UTC
  10 points
  4
  Parent
  Maybe, you could define it that way. I think R1, which uses ~naive policy gradient, is evidence that long generations are different and much easier than long eposides with environment interaction—GRPO (pretty much naive policy gradient) does no attribution to steps or parts of the trajectory, it just trains on the whole trajectory. Naive policy gradient is known to completely fail at more traditional long horizon tasks like real time video games. R1 is more like brainstorming lots of random stuff that doesn’t matter and then selecting the good stuff at the end than taking actions that actually have to be good before the final output