First, I think RL is more like 10,000x than 100x less efficient than SL (deepseek v3 probably can’t be compressed much below 10GB, while deepseek r1-zero stage can probably be compressed to 1MB of transcripts, despite both being roughly 1e24 FLOP).
Maybe this is true for SL on trajectories from the post-RL policy, but this doesn’t clearly seem like the right way to think about it from my perspective.
Maybe this is true for SL on trajectories from the post-RL policy, but this doesn’t clearly seem like the right way to think about it from my perspective.