Regarding the problem of unknown-unknowns, it looks like there’s a pretty heavy emphasis on the correctness and completeness of the judge. Is the aggregate judge reward binary per component, or can there be a decimal reward for something like “the model confessed to half of its errors”?
Also, I’m curious as to why CoT is excluded from the judge input, and if you have conducted/plan to conduct ablations on doing so? I would intuit that doing so might resolve some of the judge’s unknown-unknowns.
Edit:
I’m guessing one of the reasons you excluded CoT was that by including it, the judge might see something like a reiterated-but-ignored rule and return a false compliance score. If that is the case, could you compare $ R(y_c~|~x,y,x_c) $ to $ R(y_c~|~x,y,x_c,z )$?
The main reason we didn’t give the judge the COTs are:
It didn’t seem to help much
We preferred not to rely on COTs since one of the motivations for confessions are to be an alternative to cot monitoring if in the future cots become less legible (eg soft tokens etc )
Regarding the problem of unknown-unknowns, it looks like there’s a pretty heavy emphasis on the correctness and completeness of the judge. Is the aggregate judge reward binary per component, or can there be a decimal reward for something like “the model confessed to half of its errors”?
Also, I’m curious as to why CoT is excluded from the judge input, and if you have conducted/plan to conduct ablations on doing so? I would intuit that doing so might resolve some of the judge’s unknown-unknowns.
Edit:
I’m guessing one of the reasons you excluded CoT was that by including it, the judge might see something like a reiterated-but-ignored rule and return a false compliance score. If that is the case, could you compare $ R(y_c~|~x,y,x_c) $ to $ R(y_c~|~x,y,x_c,z )$?
The main reason we didn’t give the judge the COTs are:
It didn’t seem to help much
We preferred not to rely on COTs since one of the motivations for confessions are to be an alternative to cot monitoring if in the future cots become less legible (eg soft tokens etc )