StanislavKrym comments on Thomas Kwa’s Shortform

StanislavKrym 22 May 2025 13:06 UTC
4 points
0
For some reason, all current benchmarks, with the sole exception of OSWorld^[1], now seem to differ by a factor of less than 3. Does this imply that the progress in every benchmark is likely to slow down?
1. ^
  OSWorld resembles a physical task, which LLMs tend to fail. However, the article about LLMs failing basic physical tasks was written in April 14, before the pre-release of Gemini Diffusion. Mankind has yet to determine how well diffusion-based LLMs deal with physical tasks.