We haven’t attempted any optimizations yet, but without optimizations and using the same model for the Mind and Face, the Mind and Face runs were 30% longer for the same number of episodes in the multi-turn setting with the judge penalty. If the Face was smaller than the Mind or we put some effort into optimizations we might be able to reduce this, and plan to look into this in follow-up work involving larger models.
Ironically, I proposed a modification of the Mind-Face technique: given a model B, a reward function R and a weaker LLM A which is not trained in the episode, one could train the model B to help A reach the optimal reward. Originally I assumed that this would be an anti-sycophantic measure since B would have no need to over-praise the model A, B at its worst would remind A that humans like sycophancy. But it seems to me that the measure could also mitigate B’s reward hacking if A is able to understand B’s hacky ideas and to refuse to implement them.
We haven’t attempted any optimizations yet, but without optimizations and using the same model for the Mind and Face, the Mind and Face runs were 30% longer for the same number of episodes in the multi-turn setting with the judge penalty. If the Face was smaller than the Mind or we put some effort into optimizations we might be able to reduce this, and plan to look into this in follow-up work involving larger models.
Ironically, I proposed a modification of the Mind-Face technique: given a model B, a reward function R and a weaker LLM A which is not trained in the episode, one could train the model B to help A reach the optimal reward. Originally I assumed that this would be an anti-sycophantic measure since B would have no need to over-praise the model A, B at its worst would remind A that humans like sycophancy. But it seems to me that the measure could also mitigate B’s reward hacking if A is able to understand B’s hacky ideas and to refuse to implement them.