LawrenceC comments on Introducing the WeirdML Benchmark

LawrenceC 16 Jan 2025 19:33 UTC
2 points
−2
Makes sense, thanks!
For compute I’m using hardware we have locally with my employer, so I have not tracked what the equivalent cost of renting it would be, but I guess it would be of the same order of magnitude or as the API costs or a factor of a few larger.
It’s hard to say because I’m not even sure you can rent Titan Vs at this point,^[1] and I don’t know what your GPU utilization looks like, but I suspect API costs will dominate.
An H100 box is approximately $2/hour/GPU and A100 boxes are a fair bit under $1/hour (see e.g. pricing on Vast AI or Shadeform). And even A100s are ridiculously better than a Titan V, in that it has 40 or 80 GB of memory and (pulling number out of thin air) 4-5x faster.
So if o1 costs $2 per task and it’s 15 minutes per task, compute will be an order of magnitude cheaper. (Though as for all similar evals, the main cost will be engineering effort from humans.)
1. ^
  I failed to find an option to rent them online, and I suspect the best way I can acquire them is by going to UC Berkeley and digging around in old compute hardware.
- Håvard Tveit Ihle 16 Jan 2025 21:21 UTC
  1 point
  0
  Parent
  API costs will definitely dominate for o1-preview, but most of the runs are with models that are orders of magnitude cheaper, and then it is not clear what dominates.
  
  Going forward, models like o1-preview (or even more expensive) will probably dominate the cost, so the compute will probably be a small fraction.