Really interesting work! A few questions and comments:
How many questions & answers are shown in the phase 1 results for the model and for the teammate?
Could the results be explained just by the model being able to distinguish harder from easier questions, and delegating the harder questions to its teammate, without any kind of clear model of itself or its teammate?
I worry slightly that including ‘T’ in the choices will have weird results because it differs so much from standard multiple choice; did you consider first asking whether the model wanted to delegate, and then (if not delegating) having it pick one of A–D? That could include showing the model the answer options prior to the delegation decision or not; not sure which would work better.
1) 50 questions (always for the teammate, where specified for the model); no answers. Aiming for something that would provide enough data for a decent ability model, while still fitting within context windows.
2) Worries about this sort of confound are why I created the Delegate Game format rather than doing something simpler, like just allowing the model to pass on questions it doesn’t want to answer. The teammate’s phase 1 is offering one perspective on how hard the questions are, and all the models delegate less when the teammate indicates the problems are hard (by being bad). The GPQA questions have human-rated difficulty, which is sometimes predictive of the model delegating, but the model’s prior performance (and prior entropy, where available) is a better predictor of which answers it will delegate.
3) Yeah, I considered (but didn’t try) it. I wanted to encourage the model to actively think about what it might answer, and not default to “this looks hard/easy”. Might be interesting to try. These models are all pretty smart, and didn’t have any difficulty with the additional option (which I set up in the system prompt as well).
Really interesting work! A few questions and comments:
How many questions & answers are shown in the phase 1 results for the model and for the teammate?
Could the results be explained just by the model being able to distinguish harder from easier questions, and delegating the harder questions to its teammate, without any kind of clear model of itself or its teammate?
I worry slightly that including ‘T’ in the choices will have weird results because it differs so much from standard multiple choice; did you consider first asking whether the model wanted to delegate, and then (if not delegating) having it pick one of A–D? That could include showing the model the answer options prior to the delegation decision or not; not sure which would work better.
Thanks!
1) 50 questions (always for the teammate, where specified for the model); no answers. Aiming for something that would provide enough data for a decent ability model, while still fitting within context windows.
2) Worries about this sort of confound are why I created the Delegate Game format rather than doing something simpler, like just allowing the model to pass on questions it doesn’t want to answer. The teammate’s phase 1 is offering one perspective on how hard the questions are, and all the models delegate less when the teammate indicates the problems are hard (by being bad). The GPQA questions have human-rated difficulty, which is sometimes predictive of the model delegating, but the model’s prior performance (and prior entropy, where available) is a better predictor of which answers it will delegate.
3) Yeah, I considered (but didn’t try) it. I wanted to encourage the model to actively think about what it might answer, and not default to “this looks hard/easy”. Might be interesting to try. These models are all pretty smart, and didn’t have any difficulty with the additional option (which I set up in the system prompt as well).