jsd comments on Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro

jsd 3 Sep 2025 13:58 UTC
LW: 4 AF: 3
0
AF
Another way to think about this is that it could be reasonable to spend within the same order of magnitude on each RL environment as you spend in compute cost to train on that environment. I think the compute cost for doing RL on a hard agentic software engineering task might be around $10 to $1000 ($0.1 to $1 for each long rollout and you might do 100 to 1k rollouts?), so this justifies a lot of spending per environment. And, environments can be reused across multiple training runs (though they could eventually grow obsolete).
Agreed, cf https://www.mechanize.work/blog/cheap-rl-tasks-will-waste-compute/
- ryan_greenblatt 3 Sep 2025 14:04 UTC
  LW: 5 AF: 4
  0
  AF Parent
  Thanks, I wasn’t aware of this post. (I think it overstates the level of spending we’ll see on the average RL env within a year by maybe 10x or more, but I agree directionally.)
  - jsd 3 Sep 2025 14:42 UTC
    LW: 1 AF: 1
    0
    AF Parent
    In their BOTEC, it seems you roughly agree with a group size of 64 and 5 reuses per task (since 5 * 64 is between 100 and 1k).
    You wrote $0.1 to $1 per rollout, whereas they have in mind 500,000 * $15 / 1M = $7.5. 500,000 doesn’t seem especially high for hard agentic software engineering tasks which often reach into the millions.
    Does the disagreement come from:
    Thinking the $15 estimate from opportunity cost is too high (so compute cost lower than Mechanize claims)
    Expecting most of the RL training to somehow not be end-to-end? (so compute cost lower than Mechanize claims)
    Expecting spending per RL environment to be smaller than compute spending, even if within an OOM.
    - ryan_greenblatt 3 Sep 2025 15:25 UTC
      LW: 6 AF: 5
      0
      AF Parent
      I expect lower cost per rollout on average due to AI companies doing RL on a bunch of smaller tasks and from companies not necessarily using tons of reasoning tokens on most envs. Also, API prices are marked up relative to what companies actually pay on compute which can easily add a factor of 3. If we are just looking at the hardest agentic software engineering environments, then this closes the gap a decent amount.
      I expect spending on RL enviroments to be more like 10x lower than RL training compute rather than similar (and I wouldn’t be surprised by a large gap) because it’s hard to massively scale up spending on RL envs effectively in a short period of time while we already have an scaled up industrial process for buying more compute.
      
      I’m more sympathetic to “companies will spend this much on some high quality RL envs” than “the typical RL env will be very expensive”, but I think some disagreement remains.