Patrick Leask comments on Using GPT-Eliezer against ChatGPT Jailbreaking

Patrick Leask 7 Dec 2022 15:52 UTC
1 point
0
The Eliezer moderator template rejects this prompt, however. Whilst the false positive rate for this template is high, if it were to work properly, it would also reject this prompt.
Having said that, it seems like converting a malicious prompt into an innocuous one seems a lot easier than determining whether an innocuous prompt is stimulated by malicious intent, so I think your adversarial Eliezer would outsmart the content moderating Eliezer.