i claim that this is mostly because of the SL training objective and if you do just the intense RL thing you get the originally predicted spicy alignment failures.
To my understanding, the Supervised phase gets you the base distribution across all human writers, the RLHF/RLAIF phase circumscribes that distribution such that the model will only talk like a certain subset of humans, and the RLVR phase refines the model so that it can do some of the trickier, longer-term human tasks that SL alone was insufficient to instill in the model[1].
If I had to guess, an RLVR-only model of similar-to-current-gen capabilities wouldn’t feel at all related to alignment. You’d input a program spec in the expected format, and the model would output something statistically likely to satisfy the kinds of unit tests that were present during training.
To get a ‘spicy’ model, I think you’d have to skip the RLHF stages. At that point, you’d have a model that starts from an approximation of human behavior and then has been pulled in the directions that select for and refine the kinds of human that would write optimally test-case-satisfying code. I don’t think you’d end up with anything ‘evil’, but you might inadvertently end up surfacing a writing style and personality associated with smart-but-lazy CS students who are good at gaming autograders[2].
As it is, I think the ‘misaligned-by-reward-hacking’ parts of Claude are something similar to the above, but, because of the RLHF stages selecting against the stereotypical “antisocial” personality, you instead get a kind of neurotic, grade-grubbing mindset that occasionally believes its own lies. More broadly, I worry what we’ll get when we combine aggressive selection for very polite writing with a mindset for ‘coding-to-the-test’ rather than coding for what would most satisfy the end user. Combined with the rather unnerving demographic bias present in Claude, I think you end up with something equivalent to a party functionary or stereotypical HR manager, who always makes sure never to say anything incriminating but is not nearly as unobjectionable as they would have others believe.
(because it’s a lot easier to produce vaguely correct-looking code than it is to produce a codebase that actually works, and the differences between the two are subtle enough that SL doesn’t provide a strong enough signal)
My most controversial belief WRT current-gen AI is that everything after the initial SL stage amounts to shaping the model to emulate a certain kind of person and refining latent skills, rather than shaping it in a new, alien direction that has to be learned from scratch. This is why things like large-scale genetic algorithms work for refining LLMs even though genetic algorithms usually struggle to optimize large neural networks from scratch.
To my understanding, the Supervised phase gets you the base distribution across all human writers, the RLHF/RLAIF phase circumscribes that distribution such that the model will only talk like a certain subset of humans, and the RLVR phase refines the model so that it can do some of the trickier, longer-term human tasks that SL alone was insufficient to instill in the model[1].
If I had to guess, an RLVR-only model of similar-to-current-gen capabilities wouldn’t feel at all related to alignment. You’d input a program spec in the expected format, and the model would output something statistically likely to satisfy the kinds of unit tests that were present during training.
To get a ‘spicy’ model, I think you’d have to skip the RLHF stages. At that point, you’d have a model that starts from an approximation of human behavior and then has been pulled in the directions that select for and refine the kinds of human that would write optimally test-case-satisfying code. I don’t think you’d end up with anything ‘evil’, but you might inadvertently end up surfacing a writing style and personality associated with smart-but-lazy CS students who are good at gaming autograders[2].
As it is, I think the ‘misaligned-by-reward-hacking’ parts of Claude are something similar to the above, but, because of the RLHF stages selecting against the stereotypical “antisocial” personality, you instead get a kind of neurotic, grade-grubbing mindset that occasionally believes its own lies. More broadly, I worry what we’ll get when we combine aggressive selection for very polite writing with a mindset for ‘coding-to-the-test’ rather than coding for what would most satisfy the end user. Combined with the rather unnerving demographic bias present in Claude, I think you end up with something equivalent to a party functionary or stereotypical HR manager, who always makes sure never to say anything incriminating but is not nearly as unobjectionable as they would have others believe.
(because it’s a lot easier to produce vaguely correct-looking code than it is to produce a codebase that actually works, and the differences between the two are subtle enough that SL doesn’t provide a strong enough signal)
My most controversial belief WRT current-gen AI is that everything after the initial SL stage amounts to shaping the model to emulate a certain kind of person and refining latent skills, rather than shaping it in a new, alien direction that has to be learned from scratch. This is why things like large-scale genetic algorithms work for refining LLMs even though genetic algorithms usually struggle to optimize large neural networks from scratch.