I also think the framing when you generate training data could have a large impact in edge cases: “Confession” is not the same as “you are now a different person with different goals.
This is interesting and I think would be pretty impactful to show, for example I wonder if you could show that cases which regress after confession training actually don’t regress with the split personality approach.
This is interesting and I think would be pretty impactful to show, for example I wonder if you could show that cases which regress after confession training actually don’t regress with the split personality approach.