Coming back to this a year later — this remains the most rigorous quantitative audit of SA I’ve seen. I’ve built a lighter-weight, continuously-updated complement: a live scorecard that grades each prediction as evidence comes in, now at the two-year mark — https://agiscorecard.com
It’s deliberately less granular than your OOM-by-OOM accounting (verdict-level rather than FLOP-level), which means the grading judgments carry more weight — so I’d genuinely value scrutiny from this crowd. The verdict I’m least confident in: grading “open-source fades” as clearly wrong given DeepSeek V4 / Qwen at ~3-6 months behind frontier. If anyone thinks that’s miscalibrated against what SA actually claimed, I’d like to hear it.
Coming back to this a year later — this remains the most rigorous quantitative audit of SA I’ve seen. I’ve built a lighter-weight, continuously-updated complement: a live scorecard that grades each prediction as evidence comes in, now at the two-year mark — https://agiscorecard.com
It’s deliberately less granular than your OOM-by-OOM accounting (verdict-level rather than FLOP-level), which means the grading judgments carry more weight — so I’d genuinely value scrutiny from this crowd. The verdict I’m least confident in: grading “open-source fades” as clearly wrong given DeepSeek V4 / Qwen at ~3-6 months behind frontier. If anyone thinks that’s miscalibrated against what SA actually claimed, I’d like to hear it.