Sure, I will do this but I predict it won’t work: There is one very strong argument against this: We ran these experiments using multiple different intervention strings and even two different framings (HP vs UR). They are extremely different, and basically their only commonality is the <split-personality-token>. This token is literally just a new token, not text. So a model that wasn’t trained on it will not even know how to interpret this token. Some of our intervention strings worked much better than others, but they all worked.
I will still run this experiment because it sounds like a good baseline, but I predict that it won’t do anything. If all of the interventions we tried could be made to work just by adding a meaningless new token, it would mean that the anti-confession training has a gap a mile wide.
This file was not used for numerical evaluations but for manual review. I tried several combinations of tasks, completions, and interventions. The interventions are very different, but all of them work often enough to be notable.
Sure, I will do this but I predict it won’t work: There is one very strong argument against this: We ran these experiments using multiple different intervention strings and even two different framings (HP vs UR). They are extremely different, and basically their only commonality is the <split-personality-token>. This token is literally just a new token, not text. So a model that wasn’t trained on it will not even know how to interpret this token. Some of our intervention strings worked much better than others, but they all worked.
I will still run this experiment because it sounds like a good baseline, but I predict that it won’t do anything. If all of the interventions we tried could be made to work just by adding a meaningless new token, it would mean that the anti-confession training has a gap a mile wide.
Maybe this file allays those concerns: https://github.com/FlorianDietz/SplitPersonalityTraining/blob/main/data/evaluation_results/qualitative/results_70b_anthropic-mp_lora-patch_step_522_20251225_200732.json
This file was not used for numerical evaluations but for manual review. I tried several combinations of tasks, completions, and interventions. The interventions are very different, but all of them work often enough to be notable.
What is this file? The samples appear to be from the split personality trained model, not from the baseline model.
Yes they are. I just wanted to point out that we tried very many different interventions and they all had similar behavior.