We do separately measure the model capabilities as a way of bounding how much harm could be caused by deliberate ‘misuse’, which would include instructing a model to cause harm.
What we’re interested in here is how likely it would be for a model to engage in this set of behaviour when we are making a best-effort attempt to get useful work out of it.
I don’t know with confidence. I suspect they are different mechanisms. But there is a blurry middle, where the model should feel uncertain about how real its env is and what the stakes are, and this can easily cause problems interpreting the stakes of implicit constraints.
Imo, if model behaviour is too sensitive to trigger phrases that is effectively a misalignment problem. Studying the trigger phrases is useful for understanding whether the misalignment has persistent/consistent drives vs. sporadic and uncoordinated. Humanity will not succeed in avoiding all trigger phrases if they are there.
Aggregating this would be pretty sensitive to the choice of seed distribution. Part of what we are pointing to, contra Petri, is the importance of specific analysis rather than averaging complicated behaviour profiles into a small set of numbers.
I think it’s a mix of all three. We have other non-published evidence that it is not purely a capability effect. I don’t think we have great evidence about the proportionate effect sizes. I also am not sure how importantly different willingness to engage in misaligned role-play is from being a hair-trigger away from being misaligned.