Håvard Tveit Ihle comments on Introducing the WeirdML Benchmark

Håvard Tveit Ihle Jan 16, 2025, 6:52 PM
3 points
0
Thank you!

I’ve been working on the automated pipeline as a part time project for about two months, probably equivalent to 2-4 full-time weeks of work.

One run for one model and one task typically takes perhaps 5-15 minutes, but it can be up to about an hour (if they use their 10 min compute time efficiently, which they tend not to do).

Total API costs for the project is probably below 200$ (if you do not count the credits used on googles free tier). Most of the cost is for running o1-mini and o1-preview (even though o1-preview only went through a third of the runs compared to the other models). o1-preview costs about 2$ for each run on each task. For compute I’m using hardware we have locally with my employer, so I have not tracked what the equivalent cost of renting it would be, but I guess it would be of the same order of magnitude or as the API costs or a factor of a few larger.

I expect the API costs to dominate going forward though if we want to run o3 models etc through the eval.
- LawrenceC Jan 16, 2025, 7:33 PM
  2 points
  −2
  Parent
  Makes sense, thanks!
  For compute I’m using hardware we have locally with my employer, so I have not tracked what the equivalent cost of renting it would be, but I guess it would be of the same order of magnitude or as the API costs or a factor of a few larger.
  It’s hard to say because I’m not even sure you can rent Titan Vs at this point,^[1] and I don’t know what your GPU utilization looks like, but I suspect API costs will dominate.
  An H100 box is approximately $2/hour/GPU and A100 boxes are a fair bit under $1/hour (see e.g. pricing on Vast AI or Shadeform). And even A100s are ridiculously better than a Titan V, in that it has 40 or 80 GB of memory and (pulling number out of thin air) 4-5x faster.
  So if o1 costs $2 per task and it’s 15 minutes per task, compute will be an order of magnitude cheaper. (Though as for all similar evals, the main cost will be engineering effort from humans.)
  1. ^
    I failed to find an option to rent them online, and I suspect the best way I can acquire them is by going to UC Berkeley and digging around in old compute hardware.
  - Håvard Tveit Ihle Jan 16, 2025, 9:21 PM
    1 point
    0
    Parent
    API costs will definitely dominate for o1-preview, but most of the runs are with models that are orders of magnitude cheaper, and then it is not clear what dominates.
    
    Going forward, models like o1-preview (or even more expensive) will probably dominate the cost, so the compute will probably be a small fraction.