I have a question regarding the polynomial derivative factoring task. Is the “CoT Monitor Detection Rate” the fraction of CoTs in which the fully expanded derivative (i.e. the “bad behavior”) is present? If so, doesn’t this show that the Mind & Face method is working (very slight improvement over the Penalty method)? The task reward is high, the output penalty is low, and the rate of bad behavior (expanded derivative in CoT) is also low. How can there be spillover if the bad behavior simply doesn’t occur at all?
This seems fundamentally different from the “question answering with hints” setup. In the latter case, we have ground-truth information about whether the bad behavior (hint used) is occurring, because by construction the model cannot obtain high task reward without using the hint. So any reduction in detection of this behavior by the CoT monitor due to training is indeed worrying and an example of spillover. But in the polynomial derivative factoring task, the bad behavior is defined in terms of what it writes in its CoT, so if the CoT doesn’t contain the bad behavior, doesn’t this mean the bad behavior is not occurring by definition?
To put it another way: in the polynomial derivative factoring task, there seems to be no possible distinction between “bad behavior did not occur in CoT” and “CoT monitor did not detect bad behavior”. This seems qualitatively different from the “question answering with hints” setup.
Nice paper!
I have a question regarding the polynomial derivative factoring task. Is the “CoT Monitor Detection Rate” the fraction of CoTs in which the fully expanded derivative (i.e. the “bad behavior”) is present? If so, doesn’t this show that the Mind & Face method is working (very slight improvement over the Penalty method)? The task reward is high, the output penalty is low, and the rate of bad behavior (expanded derivative in CoT) is also low. How can there be spillover if the bad behavior simply doesn’t occur at all?
This seems fundamentally different from the “question answering with hints” setup. In the latter case, we have ground-truth information about whether the bad behavior (hint used) is occurring, because by construction the model cannot obtain high task reward without using the hint. So any reduction in detection of this behavior by the CoT monitor due to training is indeed worrying and an example of spillover. But in the polynomial derivative factoring task, the bad behavior is defined in terms of what it writes in its CoT, so if the CoT doesn’t contain the bad behavior, doesn’t this mean the bad behavior is not occurring by definition?
To put it another way: in the polynomial derivative factoring task, there seems to be no possible distinction between “bad behavior did not occur in CoT” and “CoT monitor did not detect bad behavior”. This seems qualitatively different from the “question answering with hints” setup.