snewman comments on METR’s preliminary evaluation of o3 and o4-mini

snewman 24 Apr 2025 17:49 UTC
1 point
0
What’s your basis for “well-defined tasks” vs. “realistic tasks” to have very different doubling times going forward? Is the idea that the recent acceleration seems to be specifically due to RL, and RL will be applicable to well-defined tasks but not realistic tasks?
This seems like an extremely important question, so if you have any further thoughts / intuitions / data to share, I’d be very interested.
- Thomas Kwa 25 Apr 2025 1:01 UTC
  2 points
  2
  Parent
  Yes. RL will at least be more applicable to well-defined tasks. Some intuitions:
  - In my everyday, the gap between well-defined task ability and working with the METR codebase is growing
  - 4 month doubling time is faster than the rate of progress in most other realistic or unrealistic domains
  - Recent models really like to reward hack, suggesting that RL can cause some behaviors not relevant to realistic tasks
  This trend will break at some point, eg when labs get better at applying RL to realistic tasks, or when RL hits diminishing returns, but I have no idea when