I haven’t thought about this deeply, but I would assume that you might have one monitor that triggers for large variety of cases, and one of the cases is “unusual nonsensical fact/motivation mentioned in CoT”.
I think labs will want to detect weird/unusual things for a variety of reasons, including just simple product improvements, so this shouldn’t even be considered an alignment tax?
Yep, makes sense. For me, the main uncertainty here is whether we can expect a single monitor looking for a wide variety of weird/unusual things to work well enough, given that the system prompt of the monitor would have to be fairly generic and that it might sometimes be difficult to distinguish actually concerning behaviors from situations where the surrounding context of the interaction makes the weird behavior reasonable. If a separate monitor is required for each weird behavior the AI company might want to catch, the alignment tax would be much higher compared to worlds where a single monitor with a generic instruction to look for rare unusual behaviors works very well.
I haven’t thought about this deeply, but I would assume that you might have one monitor that triggers for large variety of cases, and one of the cases is “unusual nonsensical fact/motivation mentioned in CoT”.
I think labs will want to detect weird/unusual things for a variety of reasons, including just simple product improvements, so this shouldn’t even be considered an alignment tax?
Yep, makes sense. For me, the main uncertainty here is whether we can expect a single monitor looking for a wide variety of weird/unusual things to work well enough, given that the system prompt of the monitor would have to be fairly generic and that it might sometimes be difficult to distinguish actually concerning behaviors from situations where the surrounding context of the interaction makes the weird behavior reasonable. If a separate monitor is required for each weird behavior the AI company might want to catch, the alignment tax would be much higher compared to worlds where a single monitor with a generic instruction to look for rare unusual behaviors works very well.