Jonathan Gabor

Karma: 138

Jonathan Gabor 10 Jun 2026 18:12 UTC
2 points
0
on: Tracing Eval-Awareness Emergence Through Training of OLMo 3
From your paper draft:
gpt-5-mini (the Goodfire judge) became unavailable mid-project, so RLVR and the SFT/DPO sweeps use gpt-5-mini, the BASE/refusal numbers use claude-haiku-4.5, and the pretraining sweep uses gemini-2.5-flash. Comparisons that cross a judge boundary (notably BASE vs. the others, and pretraining vs. post-training) are not strictly calibrated and are flagged as reference-only
I think this is pretty significant, and might be a factor in the massive jump in VEA from base to step 1000 of SFT.

SWE-Bench Pro is even worse

Jonathan Gabor24 Feb 2026 22:51 UTC

24 points

0 comments1 min readLW link

(jonathanpgabor.substack.com)

Maybe benchmarks should be broken?

Jonathan Gabor17 Feb 2026 19:49 UTC

24 points

2 comments1 min readLW link

(jonathanpgabor.substack.com)

Every Benchmark is Broken

Jonathan Gabor24 Jan 2026 2:42 UTC

95 points

0 comments4 min readLW link

(jonathanpgabor.substack.com)