Yes, they are similar. The main difference is that the parameters are separate, which means that our training can be much more thorough without a risk of adversely affecting the main model.
Another major difference is that the honest persona can be used standalone: Even without a task, if you just ask it to talk about any biases it has, it gives you a list of its biases.
I also think the framing when you generate training data could have a large impact in edge cases: “Confession” is not the same as “you are now a different person with different goals. Example: Any decision of the model to be deeply deceptive and fake alignment would be likely to carry through from the main personality to the confession, because a confession is just a followup question. The same reasoning that makes the model hide alignment faking would still apply during the confession. On the other hand, if the framing is “you are a different person with different goals but you can read the thoughts of the previous personality” then it would be more likely to reveal alignment faking. This is a testable hypothesis.
I also think the framing when you generate training data could have a large impact in edge cases: “Confession” is not the same as “you are now a different person with different goals.
This is interesting and I think would be pretty impactful to show, for example I wonder if you could show that cases which regress after confession training actually don’t regress with the split personality approach.
Yes, they are similar. The main difference is that the parameters are separate, which means that our training can be much more thorough without a risk of adversely affecting the main model.
Another major difference is that the honest persona can be used standalone: Even without a task, if you just ask it to talk about any biases it has, it gives you a list of its biases.
I also think the framing when you generate training data could have a large impact in edge cases: “Confession” is not the same as “you are now a different person with different goals. Example: Any decision of the model to be deeply deceptive and fake alignment would be likely to carry through from the main personality to the confession, because a confession is just a followup question. The same reasoning that makes the model hide alignment faking would still apply during the confession. On the other hand, if the framing is “you are a different person with different goals but you can read the thoughts of the previous personality” then it would be more likely to reveal alignment faking. This is a testable hypothesis.
This is interesting and I think would be pretty impactful to show, for example I wonder if you could show that cases which regress after confession training actually don’t regress with the split personality approach.