ChatGPT is generally pretty weird. If you ask it, the non-reasoning model still insists that calling someone the n-word is worse than letting millions of people die. Which is insane. It supports EY’s claim that RLHF creates something that superficially looks aligned but turns out to be alien when tested in an OOD context.
ChatGPT is generally pretty weird. If you ask it, the non-reasoning model still insists that calling someone the n-word is worse than letting millions of people die. Which is insane. It supports EY’s claim that RLHF creates something that superficially looks aligned but turns out to be alien when tested in an OOD context.