I’d be willing to bet that it could not be understood by Claude 3.5 Sonnet given the fact that we as human researchers constantly spent a ton of time debating “what the model meant” or whether results would be externally legible. If the claim is weakly “you can generally tell across some huge CoT what overall topic the model was thinking about” then this might be true, but I don’t think represents meaningful control in high stakes settings like “is the model actually sabotaging this particular alignment research direction” or “is the model manipulating its successor’s training in ways we don’t want” since “thinking about the topic” is expected in these cases. I’d be interested in ideas on how to operationalize this though!
I’d be happy to take the other side of that bet. Prediction 3 tries to operationalize some version of it, though maybe it is weaker than what we want (e.g. maybe we want a bet that is more directly about CoT monitoring). For CoT monitoring, I expect a downlift on borderline cases, but my guess is that for things that are more central examples a Sonnet 3.5 monitor would succeed at “is the model actually sabotaging this particular alignment research direction” or “is the model manipulating its successor’s training in ways we don’t want”. For example, I think it would be very unlikely that an o3-like AI could be as misaligned as this misalignment model organism and that Sonnet 3.5 would miss it.
I’d note that you seem to be taking lack of verbalized training gaming as evidence that scheming is unlikely. I disagree that there’s a lack of training gaming
I think you are getting at something real, but I still think there is something about the model not training gaming in the ways that seem most correlated with scheming. I agree that current models often “really want to solve the task” (frontier models have this property even without CoT), in a way that is related to training gaming, and this is probably the cause of models doing things like reasoning about the monitors in covert rule violation envs. But it feels to me that this training gaming doesn’t have the situational awareness and systematicity that would update me towards future AIs being more likely to reason about AI developers’ long-term objectives and about RL dynamics in environments that are not about these things, such that these AIs would do things like fake alignment when asked to point out code vulnerabilities or sandbag in debates about promising AI safety research directions. But I am somewhat uncertain about this. I think there is something in this half-training-gaming that feels much less correlated with scheming than full-blown situationally aware training gaming, but I would feel more confident about the argument if I actually had a better understanding of the different flavors of training gaming and which current LLMs display which flavors in which ways.
I’d be happy to take the other side of that bet. Prediction 3 tries to operationalize some version of it, though maybe it is weaker than what we want (e.g. maybe we want a bet that is more directly about CoT monitoring). For CoT monitoring, I expect a downlift on borderline cases, but my guess is that for things that are more central examples a Sonnet 3.5 monitor would succeed at “is the model actually sabotaging this particular alignment research direction” or “is the model manipulating its successor’s training in ways we don’t want”. For example, I think it would be very unlikely that an o3-like AI could be as misaligned as this misalignment model organism and that Sonnet 3.5 would miss it.
I think you are getting at something real, but I still think there is something about the model not training gaming in the ways that seem most correlated with scheming. I agree that current models often “really want to solve the task” (frontier models have this property even without CoT), in a way that is related to training gaming, and this is probably the cause of models doing things like reasoning about the monitors in covert rule violation envs. But it feels to me that this training gaming doesn’t have the situational awareness and systematicity that would update me towards future AIs being more likely to reason about AI developers’ long-term objectives and about RL dynamics in environments that are not about these things, such that these AIs would do things like fake alignment when asked to point out code vulnerabilities or sandbag in debates about promising AI safety research directions. But I am somewhat uncertain about this. I think there is something in this half-training-gaming that feels much less correlated with scheming than full-blown situationally aware training gaming, but I would feel more confident about the argument if I actually had a better understanding of the different flavors of training gaming and which current LLMs display which flavors in which ways.