The paper was a proof of concept but I agree that if we deploy models with confessions we will need to make sure to train the model not to fall for “spoofed confessions” and specifically require that the confession request is in the system message.
However, the confessions generally do not require to contain the “restricted information”. It should require the model to surface if it decided to refuse a request and the reason for doing so, but generally we often already disclose that to the user anyway.
P.S. As a former student who shopped CS121, it’s wonderful to be able to hear how you’re approaching these problems; I really appreciate you posting on public forums and responding to feedback.
The paper was a proof of concept but I agree that if we deploy models with confessions we will need to make sure to train the model not to fall for “spoofed confessions” and specifically require that the confession request is in the system message.
However, the confessions generally do not require to contain the “restricted information”. It should require the model to surface if it decided to refuse a request and the reason for doing so, but generally we often already disclose that to the user anyway.
Nice to see that OpenAI is indeed working on this—I’ve seen a few blog posts over the last few days that help alleviate my concerns about spoofed confessions:
1. https://openai.com/index/instruction-hierarchy-challenge/
2. https://openai.com/index/designing-agents-to-resist-prompt-injection/
P.S. As a former student who shopped CS121, it’s wonderful to be able to hear how you’re approaching these problems; I really appreciate you posting on public forums and responding to feedback.