For some reason, all current benchmarks, with the sole exception of OSWorld[1], now seem to differ by a factor of less than 3. Does this imply that the progress in every benchmark is likely to slow down?
OSWorld resembles a physical task, which LLMs tend to fail. However, the article about LLMs failing basic physical tasks was written in April 14, before the pre-release of Gemini Diffusion. Mankind has yet to determine how well diffusion-based LLMs deal with physical tasks.
For some reason, all current benchmarks, with the sole exception of OSWorld[1], now seem to differ by a factor of less than 3. Does this imply that the progress in every benchmark is likely to slow down?
OSWorld resembles a physical task, which LLMs tend to fail. However, the article about LLMs failing basic physical tasks was written in April 14, before the pre-release of Gemini Diffusion. Mankind has yet to determine how well diffusion-based LLMs deal with physical tasks.