It doesn’t seem like the normal jailbreaks are any worse, either. I recall hearing about Gemini committing in CoT to output a refusal after recognizing the jailbreak, then going through with it anyways. I’ve more confidently seen it very confidently committing to both sides of an argument, alternating between extremes every few seconds, in CoT. Apt title, “spineless”.
I think chances are pretty high we will see some serious incidents coming along with 3.0.
It is extremely easy to gaslight it using temporal confusion. As for roleplay jailbreaks, Reddit is already all over it for gore and violence.
It doesn’t seem like the normal jailbreaks are any worse, either. I recall hearing about Gemini committing in CoT to output a refusal after recognizing the jailbreak, then going through with it anyways. I’ve more confidently seen it very confidently committing to both sides of an argument, alternating between extremes every few seconds, in CoT. Apt title, “spineless”.