Vladimir_Nesov comments on Using GPT-Eliezer against ChatGPT Jailbreaking

Vladimir_Nesov 7 Dec 2022 8:05 UTC
LW: 11 AF: 4
3
AF
So the input channel is used both for unsafe input, and for instructions that the output should follow. What a wonderful equivocation at the heart of an AI system! When you feed partially unsafe input to a complicated interpreter, it often ends in tears: SQL injections, Log4Shell, uncontrolled format strings.

This is doomed without at least an unambiguous syntax that distinguishes potential attacks from authoritative instructions, something that can’t be straightforwardly circumvented by malicious input. Multiple input and output channels with specialized roles should do the trick. (It’s unclear how to train a model to work with multiple channels, but possibly fine-tuning on RLHF phase is sufficient to specialize the channels.)

Specialized outputs could do diagnostics/interpretability, like providing this meta step-by-step commentary on the character of unsafe input, SSL simulation that is not fine-tuned like the actual response is, or epistemic status of the actual response, facts relevant to it, and so on.
- Lao Mein 7 Dec 2022 10:15 UTC
  1 point
  0
  Parent
  Can you just extend the input layer for fine-tuning? Or just leave a portion of the input layer blank during training and only use it during fine-tuning, when you use it specifically for instructions? I wonder how much data it would need for that.