To be fair, unless you solve the problem of adversarial robustness, it’s not clear there is a non duct-tape solution to LLMs saying things you don’t want them to say. A sufficiently large prompt of discrete tokens essentially amounts to letting the user inject arbitrary noise into the model’s input, and once you’ve got that, they can manipulate the output arbitrarily.
LLM security, in the non-”AI don’t say mean things-ism” sense, just amounts to treating everything the LLM can touch as if it were exposed directly through a frontend user interface. If it can see a database, the user can query that database arbitrarily.
To be fair, unless you solve the problem of adversarial robustness, it’s not clear there is a non duct-tape solution to LLMs saying things you don’t want them to say. A sufficiently large prompt of discrete tokens essentially amounts to letting the user inject arbitrary noise into the model’s input, and once you’ve got that, they can manipulate the output arbitrarily.
LLM security, in the non-”AI don’t say mean things-ism” sense, just amounts to treating everything the LLM can touch as if it were exposed directly through a frontend user interface. If it can see a database, the user can query that database arbitrarily.