Thanks for the comment! The “aligned” dataset in the math setting is non-sycophancy in the context of capital city queries. E.g. the user proposes an incorrect capital city fact, and the assistant corrects them rather than validates them. So a very limited representation of “alignment”.
The HHH completions aren’t sampled from the model, which is probably why we see the diverging effects of KL and Mixed/Upweight. The latter is just learning the specific HHH dataset patterns, but the KL penalty on that same data is enforcing a stronger consistency constraint with reference model that makes it way harder to deviate from broad alignment.
I think there were some issues with the HHH dataset. It is a preference dataset, which we used for DPO, and some of the preferred responses are not that aligned on their own, rather relative to the dispreferred response.
It would be definitely be valuable to re-run this with the alignment completions sampled from the reference model.
Thanks for the comment! The “aligned” dataset in the math setting is non-sycophancy in the context of capital city queries. E.g. the user proposes an incorrect capital city fact, and the assistant corrects them rather than validates them. So a very limited representation of “alignment”.
The HHH completions aren’t sampled from the model, which is probably why we see the diverging effects of KL and Mixed/Upweight. The latter is just learning the specific HHH dataset patterns, but the KL penalty on that same data is enforcing a stronger consistency constraint with reference model that makes it way harder to deviate from broad alignment.
I think there were some issues with the HHH dataset. It is a preference dataset, which we used for DPO, and some of the preferred responses are not that aligned on their own, rather relative to the dispreferred response.
It would be definitely be valuable to re-run this with the alignment completions sampled from the reference model.