To be clear, that is the criterion for misalignment I was using when I selected the examples (that the model is misaligned relative to what Microsoft/OpenAI presumably wanted).
From the post:
My main takeaway has been that I’m honestly surprised at how bad the fine-tuning done by Microsoft/OpenAI appears to be, especially given that a lot of these failure modes seem new/worse relative to ChatGPT.
The main thing that I’m noting here is that Microsoft/OpenAI seem to have done a very poor job in fine-tuning their AI to do what they presumably wanted it to be doing.
For what it’s worth, I think this comment seems clearly right to me, even if one thinks the post actually shows misalignment. I’m confused about the downvotes of this (5 net downvotes and 12 net disagree votes as of writing this).
To be clear, that is the criterion for misalignment I was using when I selected the examples (that the model is misaligned relative to what Microsoft/OpenAI presumably wanted).
From the post:
The main thing that I’m noting here is that Microsoft/OpenAI seem to have done a very poor job in fine-tuning their AI to do what they presumably wanted it to be doing.
In the future, I would recommend a lower fraction of examples which are so easy to misinterpret.
For what it’s worth, I think this comment seems clearly right to me, even if one thinks the post actually shows misalignment. I’m confused about the downvotes of this (5 net downvotes and 12 net disagree votes as of writing this).
See here for an explanation of why I chose the examples that I did.