I agree that the ability to monitor the CoT for signs of a backdoor is exciting. However, a backdoor seems hard to catch through usual kinds of CoT monitoring, since it’s triggered by a very small fraction of inputs. This makes me curious: do you expect CoT monitoring for backdoors (and other similarly rare behaviors, e.g. mentioning highly concerning beliefs) to be used during deployment or during stress-testing evaluations? If you expect it to be used at deployment, do you expect that AI companies are going to be willing to pay the cost of running a monitor that triggers on a very small fraction of prompts, or do you expect monitors for more common misbehaviors to be trained to also catch such rare behaviors?
I haven’t thought about this deeply, but I would assume that you might have one monitor that triggers for large variety of cases, and one of the cases is “unusual nonsensical fact/motivation mentioned in CoT”.
I think labs will want to detect weird/unusual things for a variety of reasons, including just simple product improvements, so this shouldn’t even be considered an alignment tax?
Yep, makes sense. For me, the main uncertainty here is whether we can expect a single monitor looking for a wide variety of weird/unusual things to work well enough, given that the system prompt of the monitor would have to be fairly generic and that it might sometimes be difficult to distinguish actually concerning behaviors from situations where the surrounding context of the interaction makes the weird behavior reasonable. If a separate monitor is required for each weird behavior the AI company might want to catch, the alignment tax would be much higher compared to worlds where a single monitor with a generic instruction to look for rare unusual behaviors works very well.
I agree that the ability to monitor the CoT for signs of a backdoor is exciting. However, a backdoor seems hard to catch through usual kinds of CoT monitoring, since it’s triggered by a very small fraction of inputs. This makes me curious: do you expect CoT monitoring for backdoors (and other similarly rare behaviors, e.g. mentioning highly concerning beliefs) to be used during deployment or during stress-testing evaluations? If you expect it to be used at deployment, do you expect that AI companies are going to be willing to pay the cost of running a monitor that triggers on a very small fraction of prompts, or do you expect monitors for more common misbehaviors to be trained to also catch such rare behaviors?
I haven’t thought about this deeply, but I would assume that you might have one monitor that triggers for large variety of cases, and one of the cases is “unusual nonsensical fact/motivation mentioned in CoT”.
I think labs will want to detect weird/unusual things for a variety of reasons, including just simple product improvements, so this shouldn’t even be considered an alignment tax?
Yep, makes sense. For me, the main uncertainty here is whether we can expect a single monitor looking for a wide variety of weird/unusual things to work well enough, given that the system prompt of the monitor would have to be fairly generic and that it might sometimes be difficult to distinguish actually concerning behaviors from situations where the surrounding context of the interaction makes the weird behavior reasonable. If a separate monitor is required for each weird behavior the AI company might want to catch, the alignment tax would be much higher compared to worlds where a single monitor with a generic instruction to look for rare unusual behaviors works very well.