gpt-5-mini (the Goodfire judge) became unavailable mid-project, so RLVR and the SFT/DPO sweeps use gpt-5-mini, the BASE/refusal numbers use claude-haiku-4.5, and the pretraining sweep uses gemini-2.5-flash. Comparisons that cross a judge boundary (notably BASE vs. the others, and pretraining vs. post-training) are not strictly calibrated and are flagged as reference-only
I think this is pretty significant, and might be a factor in the massive jump in VEA from base to step 1000 of SFT.
From your paper draft:
I think this is pretty significant, and might be a factor in the massive jump in VEA from base to step 1000 of SFT.