My comment was mostly based on the CAI paper, where they compared the new method against their earlier RLHF model and reported more robustness against jailbreaking. Now OpenAI’s GPT-4 (though not Microsoft’s Bing version) seems to be also a lot more robust than GPT-3.5, but I don’t know why.
My comment was mostly based on the CAI paper, where they compared the new method against their earlier RLHF model and reported more robustness against jailbreaking. Now OpenAI’s GPT-4 (though not Microsoft’s Bing version) seems to be also a lot more robust than GPT-3.5, but I don’t know why.