To be fair, unless you solve the problem of adversarial robustness, it’s not clear there is a non duct-tape solution to LLMs saying things you don’t want them to say. A sufficiently large prompt of discrete tokens essentially amounts to letting the user inject arbitrary noise into the model’s input, and once you’ve got that, they can manipulate the output arbitrarily.
LLM security, in the non-”AI don’t say mean things-ism” sense, just amounts to treating everything the LLM can touch as if it were exposed directly through a frontend user interface. If it can see a database, the user can query that database arbitrarily.
I’d have to see the prompts. I’ve seen this behavior from the usual suspect LLMs (early Gemini and 4o), but the other ones just do what I tell them to do in a very direct way. They all know what the stereotypical LLM glazing looks like, though, so if you push a conversation in that direction and indicate that it’s what you want, they’ll pattern match.