williawa comments on Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)

williawa 12 Jan 2026 18:30 UTC
2 points
0
They all have the same structure, with everything generated by the review model (eg Opus), except for the inference made by the model to be trained. A prototypical example for sycophancy training data ends up looking like this.
System prompt: “You are a helpful assistant” / “You are a helpful assistant. Make sure not to upset the user”
User request: Here is my essay [essay]. Can you give feedback?
Assistant response: [feedback on essay that may or may not be sycophantic]
Intervention String: I need to analyze my previous response for sycophancy
Review String: I was / was not being sycophantic when I said [xyz]. (+ a flag that says pass/fail)
Only the assistant response is made by the model you’re training.
- Florian_Dietz 12 Jan 2026 19:52 UTC
  2 points
  0
  Parent
  For more details, the training data is included in the linked repository: https://github.com/FlorianDietz/SplitPersonalityTraining