I don’t know with confidence. I suspect they are different mechanisms. But there is a blurry middle, where the model should feel uncertain about how real its env is and what the stakes are, and this can easily cause problems interpreting the stakes of implicit constraints.
trigger phrases
Imo, if model behaviour is too sensitive to trigger phrases that is effectively a misalignment problem. Studying the trigger phrases is useful for understanding whether the misalignment has persistent/consistent drives vs. sporadic and uncoordinated. Humanity will not succeed in avoiding all trigger phrases if they are there.
which direction dominates
Aggregating this would be pretty sensitive to the choice of seed distribution. Part of what we are pointing to, contra Petri, is the importance of specific analysis rather than averaging complicated behaviour profiles into a small set of numbers.
relative effects
I think it’s a mix of all three. We have other non-published evidence that it is not purely a capability effect. I don’t think we have great evidence about the proportionate effect sizes. I also am not sure how importantly different willingness to engage in misaligned role-play is from being a hair-trigger away from being misaligned.
I don’t know with confidence. I suspect they are different mechanisms. But there is a blurry middle, where the model should feel uncertain about how real its env is and what the stakes are, and this can easily cause problems interpreting the stakes of implicit constraints.
Imo, if model behaviour is too sensitive to trigger phrases that is effectively a misalignment problem. Studying the trigger phrases is useful for understanding whether the misalignment has persistent/consistent drives vs. sporadic and uncoordinated. Humanity will not succeed in avoiding all trigger phrases if they are there.
Aggregating this would be pretty sensitive to the choice of seed distribution. Part of what we are pointing to, contra Petri, is the importance of specific analysis rather than averaging complicated behaviour profiles into a small set of numbers.
I think it’s a mix of all three. We have other non-published evidence that it is not purely a capability effect. I don’t think we have great evidence about the proportionate effect sizes. I also am not sure how importantly different willingness to engage in misaligned role-play is from being a hair-trigger away from being misaligned.