A lot of what reads as “hard to check” is really “hard to check now.” A strategic memo you can’t evaluate at t=0 becomes much easier to evaluate at t=6 months, once the predictions have played out and the downstream decisions have revealed whether the reasoning held up. Labs already have some access to this delayed signal (user follow-ups, complaints weeks later), and they could prompt for more of it (even just having the AI say “let’s check back in six weeks on what worked” would generate useful training data).
This splits the hard-to-check space in two.
Some tasks have eventual ground truth: forecasts resolve, strategy memos make predictions that come true or don’t, code works in production or doesn’t. Retrospective evaluation on these is tractable, and KalshiBench-style evals show it’s already being done. Labs have some commercial reason to build more of this, though you’re right in the appendix that the pressure probably doesn’t reach the hardest cases.
Other tasks don’t have eventual ground truth in any reasonable timeframe. Is an alignment agenda any good? Is an interpretability claim conceptually sound? t+1 month doesn’t help; t+2 years often doesn’t either. Experts disagree and the disagreement doesn’t resolve.
The handoff tasks you’re most worried about live mostly in the second category, even if plenty of useful alignment work is also fairly empirical.
What makes me slightly more hopeful is that calibration itself is trainable, “Language Models (Mostly) Know What They Know” suggest an AI saying “I’m not confident here, allocate more time” is possible, and getting there would really a bunch for getting a better signal on conceptually harder questions.
Strong post. One thing I’d add.
A lot of what reads as “hard to check” is really “hard to check now.” A strategic memo you can’t evaluate at t=0 becomes much easier to evaluate at t=6 months, once the predictions have played out and the downstream decisions have revealed whether the reasoning held up. Labs already have some access to this delayed signal (user follow-ups, complaints weeks later), and they could prompt for more of it (even just having the AI say “let’s check back in six weeks on what worked” would generate useful training data).
This splits the hard-to-check space in two.
Some tasks have eventual ground truth: forecasts resolve, strategy memos make predictions that come true or don’t, code works in production or doesn’t. Retrospective evaluation on these is tractable, and KalshiBench-style evals show it’s already being done. Labs have some commercial reason to build more of this, though you’re right in the appendix that the pressure probably doesn’t reach the hardest cases.
Other tasks don’t have eventual ground truth in any reasonable timeframe. Is an alignment agenda any good? Is an interpretability claim conceptually sound? t+1 month doesn’t help; t+2 years often doesn’t either. Experts disagree and the disagreement doesn’t resolve.
The handoff tasks you’re most worried about live mostly in the second category, even if plenty of useful alignment work is also fairly empirical.
What makes me slightly more hopeful is that calibration itself is trainable, “Language Models (Mostly) Know What They Know” suggest an AI saying “I’m not confident here, allocate more time” is possible, and getting there would really a bunch for getting a better signal on conceptually harder questions.