Thanks for the feedback. If by actual misalignment you mean the type that emerges from instrumental convergence, then I agree it is a distinct and massive risk compared to misalignment from roleplaying or personas. I think these type of interventions are useful in the instrumental convergence case for two reasons.
Ruling out alternative hypotheses for instrumental convergence. Right now, it is difficult to tell if a model is power-seeking because of instrumental convergence or because it is simply predicting what a character does in its training corpus. We can remove the data relevant to such characters, and if the model still exhibits strong power-seeking behavior, then I claim it is much stronger evidence for true instrumental convergence.
Securing the base substrate against hacking. Even if instrumental convergence is the end-game danger, probably the individual quirks it develops are path dependent, and early data filtering on the persona can help these quirks be good (rather than neutral, or neutral rather than bad, or less bad rather than super awful). And separately, if the base pretraining substrate has a strong alignment prior, we could buy some more time before the model stumbles upon strategies like exploration hacking or gradient hacking during post-training.
Thanks for the feedback. If by actual misalignment you mean the type that emerges from instrumental convergence, then I agree it is a distinct and massive risk compared to misalignment from roleplaying or personas. I think these type of interventions are useful in the instrumental convergence case for two reasons.
Ruling out alternative hypotheses for instrumental convergence. Right now, it is difficult to tell if a model is power-seeking because of instrumental convergence or because it is simply predicting what a character does in its training corpus. We can remove the data relevant to such characters, and if the model still exhibits strong power-seeking behavior, then I claim it is much stronger evidence for true instrumental convergence.
Securing the base substrate against hacking. Even if instrumental convergence is the end-game danger, probably the individual quirks it develops are path dependent, and early data filtering on the persona can help these quirks be good (rather than neutral, or neutral rather than bad, or less bad rather than super awful). And separately, if the base pretraining substrate has a strong alignment prior, we could buy some more time before the model stumbles upon strategies like exploration hacking or gradient hacking during post-training.