Thomas Kwa comments on Thane Ruthenis’s Shortform

Thomas Kwa 3 May 2025 19:43 UTC
LW: 10 AF: 7
4
AF
The dates used in our regression are the dates models were publicly released, not the dates we benchmarked them. If we use the latter dates, or the dates they were announced, I agree they would be more arbitrary.
Also, there is lots of noise in a time horizon measurement and it only displays any sort of pattern because we measured over many orders of magnitude and years. It’s not very meaningful to extrapolate from just 2 data points; there are many reasons one datapoint could randomly change by a couple of months or factor of 2 in time horizon.
- Release schedules could be altered
- A model could be overfit to our dataset
- One model could play less well with our elicitation/scaffolding
- One company could be barely at the frontier, and release a slightly-better model right before the leading company releases a much-better model.
All of these factors are averaged out if you look at more than 2 models. So I prefer to see each model as evidence of whether the trend is accelerating or slowing down over the last 1-2 years, rather than an individual model being very meaningful.
- Thane Ruthenis 3 May 2025 20:34 UTC
  LW: 7 AF: 3
  0
  AF Parent
  The dates used in our regression are the dates models were publicly released, not the dates we benchmarked them
  Fair, also see my un-update edit.
  Have you considered removing GPT-2 and GPT-3 from your models, and seeing what happens? As I’d previously complained, I don’t think they can be part of any underlying pattern (due to the distribution shift in the AI industry after ChatGPT/GPT-3.5). And indeed: removing them seems to produce a much cleaner trend with a ~130-day doubling.