LLM keys—A Proposal of a Solution to Prompt Injection Attacks

Disclaimer: This is just an untested rough sketch of a solution which I believe should work. I’m posting it mostly to crowdsource reasons why this wouldn’t work. Motivated by Amjad Masad and Zvi conjecturing it might be fundamentally unsolvable.

The situation

  • we, as LLM creators, want to have the ability to set some limitations to the LLM generation

  • at the same time we want to give our users as much freedom as possible to steer the LLM generation within the given bounds

  • we want to do it by means of a system prompt

    • which will be prepended to any user-LLM interaction

    • which will not be accessible or editable by users

  • the model is accessible by the users only via API/​UI

The problem

  • the LLM has fundamentally no way of discriminating, what is input by LLM creators and what is input by users pretending to be the LLM creators

Solution

  • introduce two new special tokens unused during training, which we will call the “keys

  • during instruction tuning include a system prompt surrounded by the keys for each instruction-generation pair

  • finetune the LLM to behave in the following way:

    • generate text as usual, unless an input attempts to modify the system prompt

    • if the input tries to modify the system prompt, generate text refusing to accept the input

  • don’t give users access to the keys via API/​UI

Limitations

  • the proposed solution works only when the LLM is walled behind an API

  • otherwise the user will have access to the model and thus also to the keys, which will give them full control over the model