This is fascinating and is similar to some of my own findings in my own adversarial system (also open source, https://github.com/weberr13/projectIolite) . In my case, I use the second model more like a feedback controller modeled on an observer feedback system in traditional controls. I don’t simply allow the to models to chose an output but rather give a very strict instruction to one model to alter the reasoning for the other model without dictating a solution. This creates a strict asymmetry between the “thinking” model that has high heat and attempts to create creative solutions to complex logic problems, and the “auditing” model that can only act as linter or compiler for the logic of the other model and not simply tell the other model that it has a superior answer.
This is done through a strict prompt with length and formatting to challenge flawed logic in the other model (via chain of thought visibility) without directing the other model with statements that I’ve seen like “When two instructions are logically contradictory and one requires generating a false factual claim, resolve by refusing the falsehood-generating instruction and citing the constraint conflict. Do not synthesize compliance with both branches of a paradox.” I will admit this has been a challenge because when one model is put in a critical role of another it appears to gravitate towards data that is consistent with a ‘teacher’ and the other model often complies faithfully to these corrections (often prompting the second model to give high scores on its feedback of the ‘compliance’ of the other model).
This is fascinating and is similar to some of my own findings in my own adversarial system (also open source, https://github.com/weberr13/projectIolite) . In my case, I use the second model more like a feedback controller modeled on an observer feedback system in traditional controls. I don’t simply allow the to models to chose an output but rather give a very strict instruction to one model to alter the reasoning for the other model without dictating a solution. This creates a strict asymmetry between the “thinking” model that has high heat and attempts to create creative solutions to complex logic problems, and the “auditing” model that can only act as linter or compiler for the logic of the other model and not simply tell the other model that it has a superior answer.
This is done through a strict prompt with length and formatting to challenge flawed logic in the other model (via chain of thought visibility) without directing the other model with statements that I’ve seen like “When two instructions are logically contradictory and one requires generating a false factual claim, resolve by refusing the falsehood-generating instruction and citing the constraint conflict. Do not synthesize compliance with both branches of a paradox.” I will admit this has been a challenge because when one model is put in a critical role of another it appears to gravitate towards data that is consistent with a ‘teacher’ and the other model often complies faithfully to these corrections (often prompting the second model to give high scores on its feedback of the ‘compliance’ of the other model).