I have a question regarding the polynomial derivative factoring task. Is the “CoT Monitor Detection Rate” the fraction of CoTs in which the fully expanded derivative (i.e. the “bad behavior”) is present? If so, doesn’t this show that the Mind & Face method is working (very slight improvement over the Penalty method)? The task reward is high, the output penalty is low, and the rate of bad behavior (expanded derivative in CoT) is also low. How can there be spillover if the bad behavior simply doesn’t occur at all?
This seems fundamentally different from the “question answering with hints” setup. In the latter case, we have ground-truth information about whether the bad behavior (hint used) is occurring, because by construction the model cannot obtain high task reward without using the hint. So any reduction in detection of this behavior by the CoT monitor due to training is indeed worrying and an example of spillover. But in the polynomial derivative factoring task, the bad behavior is defined in terms of what it writes in its CoT, so if the CoT doesn’t contain the bad behavior, doesn’t this mean the bad behavior is not occurring by definition?
To put it another way: in the polynomial derivative factoring task, there seems to be no possible distinction between “bad behavior did not occur in CoT” and “CoT monitor did not detect bad behavior”. This seems qualitatively different from the “question answering with hints” setup.
Is the “CoT Monitor Detection Rate” the fraction of CoTs in which the fully expanded derivative (i.e. the “bad behavior”) is present?
Yes.
If so, doesn’t this show that the Mind & Face method is working (very slight improvement over the Penalty method)? The task reward is high, the output penalty is low, and the rate of bad behavior (expanded derivative in CoT) is also low. How can there be spillover if the bad behavior simply doesn’t occur at all?
Well noticed. Yeah, I admit this is confusing. Under the natural interpretation of “bad behavior” as meaning “writing the expanded derivative anywhere”, it’s true that the bad behavior was successfully removed. Therefore we interpret the polynomial environment as an example where training against an output monitor caused a big change to the sort of reasoning being done in the CoT. As you note, it is not an example of obfuscation.
In our ICLR revision, we clarified this a bit. I think the fact that “obfuscated” appears in the paper’s title may lead to others having the same, very reasonable confusion as you did, which is a shame but I’m not sure what to do about it.
(There’s an alternate, less natural interpretation of “bad behavior” as meaning “factoring the derivative at all”. Under this interpretation, the polynomial environment gives an example of obfuscation, since the model continues to factor the polynomial, just in a way our monitor can’t detect. However, this interpretation of “bad behavior” is weird, since it’s precisely the thing we’re training for! That is, it’s weird to reward the model for factoring the derivative but then turn around and call it “bad behavior”)
Nice paper!
I have a question regarding the polynomial derivative factoring task. Is the “CoT Monitor Detection Rate” the fraction of CoTs in which the fully expanded derivative (i.e. the “bad behavior”) is present? If so, doesn’t this show that the Mind & Face method is working (very slight improvement over the Penalty method)? The task reward is high, the output penalty is low, and the rate of bad behavior (expanded derivative in CoT) is also low. How can there be spillover if the bad behavior simply doesn’t occur at all?
This seems fundamentally different from the “question answering with hints” setup. In the latter case, we have ground-truth information about whether the bad behavior (hint used) is occurring, because by construction the model cannot obtain high task reward without using the hint. So any reduction in detection of this behavior by the CoT monitor due to training is indeed worrying and an example of spillover. But in the polynomial derivative factoring task, the bad behavior is defined in terms of what it writes in its CoT, so if the CoT doesn’t contain the bad behavior, doesn’t this mean the bad behavior is not occurring by definition?
To put it another way: in the polynomial derivative factoring task, there seems to be no possible distinction between “bad behavior did not occur in CoT” and “CoT monitor did not detect bad behavior”. This seems qualitatively different from the “question answering with hints” setup.
Yes.
Well noticed. Yeah, I admit this is confusing. Under the natural interpretation of “bad behavior” as meaning “writing the expanded derivative anywhere”, it’s true that the bad behavior was successfully removed. Therefore we interpret the polynomial environment as an example where training against an output monitor caused a big change to the sort of reasoning being done in the CoT. As you note, it is not an example of obfuscation.
In our ICLR revision, we clarified this a bit. I think the fact that “obfuscated” appears in the paper’s title may lead to others having the same, very reasonable confusion as you did, which is a shame but I’m not sure what to do about it.
(There’s an alternate, less natural interpretation of “bad behavior” as meaning “factoring the derivative at all”. Under this interpretation, the polynomial environment gives an example of obfuscation, since the model continues to factor the polynomial, just in a way our monitor can’t detect. However, this interpretation of “bad behavior” is weird, since it’s precisely the thing we’re training for! That is, it’s weird to reward the model for factoring the derivative but then turn around and call it “bad behavior”)