Agreed that the technique alone doesn’t solve it. The OpenAI writings I know of about Deliberative Alignment only apply it with a “spec” of refusal training; they don’t even touch on the moral content that Constitutional AI focuses on.
I did think that OpenAI had started using something equivalent in mechanics to Constitutional AI even for its non-reasoning models, but I don’t recall where I got that impression. And I think maybe it was based on the RLHF responses; it was another LLM predicting what human feedback woud be (which, come to think of it, could introduce errors in the direction of “humans always love it when you butter them up!”. I don’t know if they added any other criteria for automated judgment like Constitutional AI uses.
Anyway, the content of automated RL training like Constitutional AI is probably the deciding factor in whether it creates or fights sycophancy.
Agreed that the technique alone doesn’t solve it. The OpenAI writings I know of about Deliberative Alignment only apply it with a “spec” of refusal training; they don’t even touch on the moral content that Constitutional AI focuses on.
I did think that OpenAI had started using something equivalent in mechanics to Constitutional AI even for its non-reasoning models, but I don’t recall where I got that impression. And I think maybe it was based on the RLHF responses; it was another LLM predicting what human feedback woud be (which, come to think of it, could introduce errors in the direction of “humans always love it when you butter them up!”. I don’t know if they added any other criteria for automated judgment like Constitutional AI uses.
Anyway, the content of automated RL training like Constitutional AI is probably the deciding factor in whether it creates or fights sycophancy.