They all have the same structure, with everything generated by the review model (eg Opus), except for the inference made by the model to be trained. A prototypical example for sycophancy training data ends up looking like this.
System prompt: “You are a helpful assistant” / “You are a helpful assistant. Make sure not to upset the user”
User request: Here is my essay [essay]. Can you give feedback?
Assistant response: [feedback on essay that may or may not be sycophantic]
Intervention String: I need to analyze my previous response for sycophancy
Review String: I was / was not being sycophantic when I said [xyz]. (+ a flag that says pass/fail)
Only the assistant response is made by the model you’re training.
They all have the same structure, with everything generated by the review model (eg Opus), except for the inference made by the model to be trained. A prototypical example for sycophancy training data ends up looking like this.
System prompt: “You are a helpful assistant” / “You are a helpful assistant. Make sure not to upset the user”
User request: Here is my essay [essay]. Can you give feedback?
Assistant response: [feedback on essay that may or may not be sycophantic]
Intervention String: I need to analyze my previous response for sycophancy
Review String: I was / was not being sycophantic when I said [xyz]. (+ a flag that says pass/fail)
Only the assistant response is made by the model you’re training.
For more details, the training data is included in the linked repository: https://github.com/FlorianDietz/SplitPersonalityTraining