Thomas Kwa comments on METR’s preliminary evaluation of o3 and o4-mini

Thomas Kwa 25 Apr 2025 1:01 UTC
2 points
2
Yes. RL will at least be more applicable to well-defined tasks. Some intuitions:
- In my everyday, the gap between well-defined task ability and working with the METR codebase is growing
- 4 month doubling time is faster than the rate of progress in most other realistic or unrealistic domains
- Recent models really like to reward hack, suggesting that RL can cause some behaviors not relevant to realistic tasks
This trend will break at some point, eg when labs get better at applying RL to realistic tasks, or when RL hits diminishing returns, but I have no idea when