faul_sname comments on Slow corporations as an intuition pump for AI R&D automation

faul_sname 9 May 2025 17:18 UTC

4 points

I have very different intuitions about 50M GPUs for 1 week vs 200k GPUs for with 200 hours of work spread evenly across 50 years.

	SlowCorp v1	SlowCorp v2	NormalCorp v1	NormalCorp v2	AutomatedCorp
Time to work on AI R&D	50 years	50 years	50 years	50 years	50 years
Number of AI researchers and engineers	800	800	4,000	4,000	200,000
Researcher/engineer quality	Median frontier AI company researcher/engineer	Median frontier AI company researcher/engineer	Similar to current frontier AI companies if they expanded rapidly	Similar to current frontier AI companies if they expanded rapidly	Level of world’s 100 best researchers/engineers
Time worked	One week of ²⁴⁄₇ work (or four weeks at 40h / week but the GPUs are paused while the workers aren’t working)	50 years of one 4 hour session per year	One year of ²⁴⁄₇ (or four years of 40h/week but the GPUs are paused while the workers aren’t working)	50 years of 40 hours / week for 1 month per year	50 years of ²⁴⁄₇
H100s	500,000,000	200,000	10,000,000	200,000	200,000
Cumulative H100-years	10 million	10 million	10 million	10 million	10 million

I think SlowCorp-v2 would get a lot more done than SlowCorp-v1 (though obviously still a lot less than AutomatedCorp). And also SlowCorp-v2 seems to be a closer analogy than SlowCorp-v1 - both corporations have the same amount of serial time, and my intuition is that you generally can’t make a training run go 10x faster just by throwing 10x as many GPUs at it, because you’ll be bottlenecked by IO.

And I know “SlowCorp is bottlenecked by IO” is not what the point of this intuition pump was supposed to be, but at least for me, it ended up being the main consideration pumping my intuition.

ryan_greenblatt 9 May 2025 17:59 UTC
8 points
2
Parent
Yeah, I discuss this here:

The way I set up the analogy makes it seem like AutomatedCorp has a serial compute advantage: because they have 50 years they can run things that take many serial years while NormalCorp can’t. As in, the exact analogy implies that they could use a tenth of their serial time to run a 5 year long training run on 50k H100s, while they could actually only do this if the run was sufficiently parallelizable such that it could be done on 2.5 million H100s in a tenth of a year. So, you should ignore any serial compute advantage. Similarly, you should ignore difficulties that SlowCorp might have in parallelizing things sufficiently etc.

You can also imagine that SlowCorp has 10 million magically good GPUs (and CPUs etc) which are like H100s but 50x serially faster (but still only has 1 week) while AutomatedCorp has 10 million much worse versions of H100s (and CPUs etc) which are 50x serially slower but otherwise the same (and has 50 years still).
- faul_sname 9 May 2025 18:29 UTC
  4 points
  2
  Parent
  Also SlowCorp has magically 50x better networking equipment than NormalCorp, and 50x higher rate limits on every site they’re trying to scrape, and 50x as much sensor data from any process in the world, and 50x faster shipping on any physical components they need, etc etc (and AutomatedCorp has magically 50x worse of all of those things).
  
  But yeah, agreed that you should ignore all of those intuitions when considering the “1 week” scenario—I just found that I couldn’t actually turn all of those intuitions off when considering the scenario.
  - ryan_greenblatt 9 May 2025 19:09 UTC
    4 points
    0
    Parent
    Yep, but my understanding is that the time associated with marginal scraping, sensor data, and physical components don’t matter much when talking about AI progress which is on the order of a year. Or honestly, maybe marginal improvements in these sorts of components don’t matter that much at all over this time scale (like freezing all these things for a year wouldn’t be much tax if you prepped in advance). Not super sure about situation with scrapping though.