“When something quite bad can occur in a small number of actions (potentially even in a single trajectory) or in a short period of time, it’s important to sometimes block suspicious looking actions until they are reviewed.”
Why? I think there is not perfect overlap between the categories “single actions that are catastrophic, i.e. high stakes failures” and “actions that require monitoring and blocking in the moment”. One example where they come apart is “producing a batch of data that would cause the next generation of models to be misaligned”—this might be a high-stakes failure, but you have plenty of time to catch it before you actually train that model. Another example would be “writing code that is catastrophically insecure but only gets run rarely”.
Could primary main cause instead be (as Tim Hua notes) that Anthropic has trained past models to have reward-model-pleasing chains of thought?