Their alignment team gets busy using the early transformative AI to solve the alignment problems of superintelligence. The early transformative AI spits out some slop, as AI does. Alas, one of the core challenges of slop is that it looks fine at first glance, and one of the core problems of aligning superintelligence is that it’s hard to verify;
Ok, but wouldn’t we also be testing our AIs on problems that are easy to verify?
Like, when the cutting edge AIs are releasing papers that are elegantly solving long standing puzzles in physics or biology, and making surprising testable predictions along the way, we’ll know that they’re capable of producing more than slop.
Are you proposing that...
The AIs won’t be producing legible-and-verifiable breakthroughs in other fields, but they will be spitting out some ideas for AI alignment / control that seem promising to lab researchers, who decide to go with it?
The AIs will be be producing legible-and-verifiable breakthroughs in other fields, but those same AIs will be producing slop in the case of Alignment (perhaps because tricking the humans is the path of least resistance with alignment, but not with physics)?
The second one: the AIs will be producing legible-and-verifiable breakthroughs in other fields, but those same AIs will be producing slop in the case of alignment.
This should be reasonable on priors alone: hard-to-verify problems are a systematically different class than easy-to-verify problems, so it should not be a huge surprise if generalization failure occurs across that gap.
It also makes sense from the perspective of “what’s incentivized by RL?”. In hard to verify areas, evaluators have systematic and predictable biases; outputs which play to those biases can and sometimes will outscore actually-good outputs. So we should expect an RL’d system will be actively selected to produce slop which plays to those biases (as opposed to straightforwardly good things), e.g. sycophancy.
And we already see this phenomenon on easier problems today. Today’s AI already do the most impressive things in e.g. math and programming competitions, the easiest cases for verification. The harder verification gets, the more they spit out slop. So just extrapolate what we already see to the more extreme regimes of stronger AI.
(To be clear, my median guess is that we’re about one transformers-level paradigm shift away from strong AI still, and it makes a lot less sense to extrapolate today’s AI to a different paradigm. But insofar as one expects LLMs to hit criticality, especially using RL, one should expect that they’ll produce systematically more slop in harder to verify areas.)
Ok, but wouldn’t we also be testing our AIs on problems that are easy to verify?
Like, when the cutting edge AIs are releasing papers that are elegantly solving long standing puzzles in physics or biology, and making surprising testable predictions along the way, we’ll know that they’re capable of producing more than slop.
Are you proposing that...
The AIs won’t be producing legible-and-verifiable breakthroughs in other fields, but they will be spitting out some ideas for AI alignment / control that seem promising to lab researchers, who decide to go with it?
The AIs will be be producing legible-and-verifiable breakthroughs in other fields, but those same AIs will be producing slop in the case of Alignment (perhaps because tricking the humans is the path of least resistance with alignment, but not with physics)?
...or something else?
The second one: the AIs will be producing legible-and-verifiable breakthroughs in other fields, but those same AIs will be producing slop in the case of alignment.
This should be reasonable on priors alone: hard-to-verify problems are a systematically different class than easy-to-verify problems, so it should not be a huge surprise if generalization failure occurs across that gap.
It also makes sense from the perspective of “what’s incentivized by RL?”. In hard to verify areas, evaluators have systematic and predictable biases; outputs which play to those biases can and sometimes will outscore actually-good outputs. So we should expect an RL’d system will be actively selected to produce slop which plays to those biases (as opposed to straightforwardly good things), e.g. sycophancy.
And we already see this phenomenon on easier problems today. Today’s AI already do the most impressive things in e.g. math and programming competitions, the easiest cases for verification. The harder verification gets, the more they spit out slop. So just extrapolate what we already see to the more extreme regimes of stronger AI.
(To be clear, my median guess is that we’re about one transformers-level paradigm shift away from strong AI still, and it makes a lot less sense to extrapolate today’s AI to a different paradigm. But insofar as one expects LLMs to hit criticality, especially using RL, one should expect that they’ll produce systematically more slop in harder to verify areas.)