Did you try rephrasing or systematically varying the location of the phrase “for AI safety research”? It could also just be that this phrase tends to get used in jailbreaks which were adversarially trained against.
Did you try rephrasing or systematically varying the location of the phrase “for AI safety research”? It could also just be that this phrase tends to get used in jailbreaks which were adversarially trained against.