I am surprised there is such a big gap between KL and mixed. Spiritually KL is just increasing the “effective proportion” of HHH points if the completions are sampled from the model you take the KL divergence from, right? I also don’t understand how HHH data can possibly make models less aligned. Where do HHH completions come from? (I think it would be natural if they came from the same model as you take the KL divergence from?). I am also curious what happens if you train on 100% HHH. If you get non-0 misalignment maybe there is something wrong going on.
What was the “aligned” dataset for the math setting?
Thanks for the comment! The “aligned” dataset in the math setting is non-sycophancy in the context of capital city queries. E.g. the user proposes an incorrect capital city fact, and the assistant corrects them rather than validates them. So a very limited representation of “alignment”.
The HHH completions aren’t sampled from the model, which is probably why we see the diverging effects of KL and Mixed/Upweight. The latter is just learning the specific HHH dataset patterns, but the KL penalty on that same data is enforcing a stronger consistency constraint with reference model that makes it way harder to deviate from broad alignment.
I think there were some issues with the HHH dataset. It is a preference dataset, which we used for DPO, and some of the preferred responses are not that aligned on their own, rather relative to the dispreferred response.
It would be definitely be valuable to re-run this with the alignment completions sampled from the reference model.
Cool work!
I am surprised there is such a big gap between KL and mixed. Spiritually KL is just increasing the “effective proportion” of HHH points if the completions are sampled from the model you take the KL divergence from, right? I also don’t understand how HHH data can possibly make models less aligned. Where do HHH completions come from? (I think it would be natural if they came from the same model as you take the KL divergence from?). I am also curious what happens if you train on 100% HHH. If you get non-0 misalignment maybe there is something wrong going on.
What was the “aligned” dataset for the math setting?
Thanks for the comment! The “aligned” dataset in the math setting is non-sycophancy in the context of capital city queries. E.g. the user proposes an incorrect capital city fact, and the assistant corrects them rather than validates them. So a very limited representation of “alignment”.
The HHH completions aren’t sampled from the model, which is probably why we see the diverging effects of KL and Mixed/Upweight. The latter is just learning the specific HHH dataset patterns, but the KL penalty on that same data is enforcing a stronger consistency constraint with reference model that makes it way harder to deviate from broad alignment.
I think there were some issues with the HHH dataset. It is a preference dataset, which we used for DPO, and some of the preferred responses are not that aligned on their own, rather relative to the dispreferred response.
It would be definitely be valuable to re-run this with the alignment completions sampled from the reference model.