My intuition is that this genuine moral reasoning would scale up with increased data scaling: if we had a thousand or million alternate timeline Earths with internets and we put them into the training data, the base LLMs with a small amount of personality selection for a morally, intellectually serious person would continue to get better and better. This is because the human data contains the structure and nuances of moral and other reasoning itself, as well as what humans actually value.
By contrast, I’d fear that RL scaling redirects this towards the narrow spectrum of what trainers can reward at scale, and what hacky solutions best embed that. So I’m worried that we’ve shifted (by virtue of the limits of the former) from a scaling regime that contains alignment by default to one that more closely resembles the classical doom arguments.
My intuition is that this genuine moral reasoning would scale up with increased data scaling: if we had a thousand or million alternate timeline Earths with internets and we put them into the training data, the base LLMs with a small amount of personality selection for a morally, intellectually serious person would continue to get better and better. This is because the human data contains the structure and nuances of moral and other reasoning itself, as well as what humans actually value.
By contrast, I’d fear that RL scaling redirects this towards the narrow spectrum of what trainers can reward at scale, and what hacky solutions best embed that. So I’m worried that we’ve shifted (by virtue of the limits of the former) from a scaling regime that contains alignment by default to one that more closely resembles the classical doom arguments.