The second one: the AIs will be producing legible-and-verifiable breakthroughs in other fields, but those same AIs will be producing slop in the case of alignment.
This should be reasonable on priors alone: hard-to-verify problems are a systematically different class than easy-to-verify problems, so it should not be a huge surprise if generalization failure occurs across that gap.
It also makes sense from the perspective of “what’s incentivized by RL?”. In hard to verify areas, evaluators have systematic and predictable biases; outputs which play to those biases can and sometimes will outscore actually-good outputs. So we should expect an RL’d system will be actively selected to produce slop which plays to those biases (as opposed to straightforwardly good things), e.g. sycophancy.
And we already see this phenomenon on easier problems today. Today’s AI already do the most impressive things in e.g. math and programming competitions, the easiest cases for verification. The harder verification gets, the more they spit out slop. So just extrapolate what we already see to the more extreme regimes of stronger AI.
(To be clear, my median guess is that we’re about one transformers-level paradigm shift away from strong AI still, and it makes a lot less sense to extrapolate today’s AI to a different paradigm. But insofar as one expects LLMs to hit criticality, especially using RL, one should expect that they’ll produce systematically more slop in harder to verify areas.)
The second one: the AIs will be producing legible-and-verifiable breakthroughs in other fields, but those same AIs will be producing slop in the case of alignment.
This should be reasonable on priors alone: hard-to-verify problems are a systematically different class than easy-to-verify problems, so it should not be a huge surprise if generalization failure occurs across that gap.
It also makes sense from the perspective of “what’s incentivized by RL?”. In hard to verify areas, evaluators have systematic and predictable biases; outputs which play to those biases can and sometimes will outscore actually-good outputs. So we should expect an RL’d system will be actively selected to produce slop which plays to those biases (as opposed to straightforwardly good things), e.g. sycophancy.
And we already see this phenomenon on easier problems today. Today’s AI already do the most impressive things in e.g. math and programming competitions, the easiest cases for verification. The harder verification gets, the more they spit out slop. So just extrapolate what we already see to the more extreme regimes of stronger AI.
(To be clear, my median guess is that we’re about one transformers-level paradigm shift away from strong AI still, and it makes a lot less sense to extrapolate today’s AI to a different paradigm. But insofar as one expects LLMs to hit criticality, especially using RL, one should expect that they’ll produce systematically more slop in harder to verify areas.)