Some initial speculations:
My first glance is that, if misbehavior traits are rare in the pretraining data, the model’s output will likely be highly sensitive to the system prompt, especially if the model was fine-tuned under specific conditions.
Conversely, for desirable behaviors, such as self-correction or reasoning verification, it might also be more effective to trigger them using a fine-tuning setup than by relying on a neutral system prompt. (I’d be interested if is specific directions to explore this idea).
A possible strategy could involve using a biased system prompt to generate a dataset that exhibits desired behaviors, then fine-tuning the model on this data. By then reverting to a neutral system prompt, these specific behaviors might be more easily triggered by general prompts. (but how much computation do we need?)
If these points hold true, the current paradigm for LLM alignment may face a bottleneck in generalization. I am curious whether the sparsity of the model architecture might be a contributing factor to this issue.
Also there is a big concern: shouldn’t inoculation prompting directly encourage prompt injection attack? Since direct demonstration exists in training data.