I hadn’t noticed that there’d be any reason for people to claim Claude 3.7 Sonnet was “misaligned”, even though I use it frequently and have seen some versions of the behavior in question. It seems to me like… it’s often trying to find the “easy way” to do whatever it’s trying to do. When it decides something is “hard”, it backs off from that line of attack. It backs off when it decides a line of attack is wrong, too. Actually, I think “hard” might be a kind of wrong in its ontology of reasoning steps.
This is a reasoning strategy that needs to be applied carefully. Sometimes it works; one really should use the easy way rather than the hard way, if the easy way works and is easier. But sometimes the hard part is the core of the problem and one needs to just tackle it. I’ve been thinking of 3.7′s failure to tackle the hard part as a lack of in-practice capabilities, specifically the capability to notice “hey, this time I really do need to do it the hard way to do what the user asked” and just attempt the hard way.
Having read this post, I can see the other side of the coin. 3.7′s RL probably heavily incentivizes it to produce an answer / solution / whatever the user wanted done. Or at least something that appears to be what the user wanted, as far as it can tell. Such as (in a fairly extreme case) hard coding to “pass” unit tests.
I wouldn’t read too much into deceiving or lying to cover up in this case. That’s what practically any human who had chosen to clearly cheat would do in the same situation, at least until confronted. The decision to cheat in the first place is straightforwardly misaligned though. But I still can’t help thinking it’s downstream of a capabilities failure, and this particular kind of misalignment will naturally disappear once the model is smart enough to just do the thing, instead. (Which is not, of course, to say we won’t see other kinds of misalignment, or that those won’t be even more problematic.)
I hadn’t noticed that there’d be any reason for people to claim Claude 3.7 Sonnet was “misaligned”, even though I use it frequently and have seen some versions of the behavior in question. It seems to me like… it’s often trying to find the “easy way” to do whatever it’s trying to do. When it decides something is “hard”, it backs off from that line of attack. It backs off when it decides a line of attack is wrong, too. Actually, I think “hard” might be a kind of wrong in its ontology of reasoning steps.
This is a reasoning strategy that needs to be applied carefully. Sometimes it works; one really should use the easy way rather than the hard way, if the easy way works and is easier. But sometimes the hard part is the core of the problem and one needs to just tackle it. I’ve been thinking of 3.7′s failure to tackle the hard part as a lack of in-practice capabilities, specifically the capability to notice “hey, this time I really do need to do it the hard way to do what the user asked” and just attempt the hard way.
Having read this post, I can see the other side of the coin. 3.7′s RL probably heavily incentivizes it to produce an answer / solution / whatever the user wanted done. Or at least something that appears to be what the user wanted, as far as it can tell. Such as (in a fairly extreme case) hard coding to “pass” unit tests.
I wouldn’t read too much into deceiving or lying to cover up in this case. That’s what practically any human who had chosen to clearly cheat would do in the same situation, at least until confronted. The decision to cheat in the first place is straightforwardly misaligned though. But I still can’t help thinking it’s downstream of a capabilities failure, and this particular kind of misalignment will naturally disappear once the model is smart enough to just do the thing, instead. (Which is not, of course, to say we won’t see other kinds of misalignment, or that those won’t be even more problematic.)