Thomas Kwa comments on ryan_greenblatt’s Shortform

Thomas Kwa 3 Sep 2025 20:51 UTC
LW: 4 AF: 2
0
AF
The data is pretty low-quality for that graph because the agents we used were inconsistent and Claude 3-level models could barely solve any tasks. Epoch has better data for SWE-bench Verified, which I converted to time horizon here and found to also be doubling every 4 months ish. Their elicitation is probably not as good for OpenAI as Anthropic models, but both are increasing at similar rates.