Rauno Arike comments on Backdoor awareness and misaligned personas in reasoning models

Rauno Arike 24 Jun 2025 14:12 UTC
1 point
0
Yep, makes sense. For me, the main uncertainty here is whether we can expect a single monitor looking for a wide variety of weird/unusual things to work well enough, given that the system prompt of the monitor would have to be fairly generic and that it might sometimes be difficult to distinguish actually concerning behaviors from situations where the surrounding context of the interaction makes the weird behavior reasonable. If a separate monitor is required for each weird behavior the AI company might want to catch, the alignment tax would be much higher compared to worlds where a single monitor with a generic instruction to look for rare unusual behaviors works very well.