Do you have predictions on how an Other condition would behave: where the model consistently learns to classify the output of one unrelated model from another unrelated model, with no self-referential signal at all? Similar to Random?
I think this can get quite tricky if any of these other seemingly unrelated models were directly involved in the post-training stage for the judge model or by sharing a common ancestor. To claim no (or little) self-referential signal you would have to pick two other models that the judge model can differentiate from itself extremely accurately in the pairwise setting.
If these pre-requisites are met, the resultant finetuning could be the same as Random or it might also lead to a honeypot identity like what we hypothesize for the prevention scenario.
Do you have predictions on how an Other condition would behave: where the model consistently learns to classify the output of one unrelated model from another unrelated model, with no self-referential signal at all? Similar to Random?
I think this can get quite tricky if any of these other seemingly unrelated models were directly involved in the post-training stage for the judge model or by sharing a common ancestor. To claim no (or little) self-referential signal you would have to pick two other models that the judge model can differentiate from itself extremely accurately in the pairwise setting.
If these pre-requisites are met, the resultant finetuning could be the same as Random or it might also lead to a honeypot identity like what we hypothesize for the prevention scenario.