(Long-delayed response, because I’m not good at staying on top of my notifications, sorry.)
I think by “go insane”, what I meant is things like:
believe things that are harmful, counterproductive, or nonsensical, which people around you also believe / support you in believing;
do things which are harmful, counterproductive, or nonsensical, which people around you also do / support you in doing.
So, yeah, specific things and not insanity in the abstract, although not quite the same sort of specific things I think you were pointing at. In the context of Eliezer’s speech, he did give a slightly more specific interpretation:
I will not write internal scripts which say that I am supposed to / pseudo-predict that I will, do any particular stupid or dramatic thing in response to the end of the world approaching visibly nearer in any particular way.
In other words, if people around you are freaking out unproductively about the end of the world, don’t take that as social license (or mandate!) to freak out unproductively about the end of the world. I (Eliezer) here provide you with social license to instead not do that.
The only time I’ve ever had a chat flagged, I was asking Claude to try decoding a very simple Vigenere cipher (from an old children’s magazine.) So perhaps anything that looks encoded will raise a flag for trying to conceal a prompt injection.