I can’t actually think of that many cases of humans failing at aligning existing systems because the problem is too technically hard.
You’re probably already tracking this, but the biggest cases of “alignment was actually pretty tricky” I’m aware are:
Recent systems doing egregious reward hacking in some cases (including o3, 3.7 sonnet, and 4 Opus). This problem has gotten better recently (and I currently expect it to mostly get better over time, prior to superhuman capabilities), but AI companies knew about the problem before release and couldn’t solve the problem quickly enough to avoid deploying a model with this property. And note this is pretty costly to consumers!
There are a bunch of aspects of current AI propensities which are undesired and AI companies don’t know how to reliably solve these in a way that will actually generalize to similar such problems. For instance, see the model card for opus 4 which includes the model doing a bunch of undesired stuff that Anthropic doesn’t want but also can’t easily avoid except via patching it non-robustly (because they don’t necessarily know exactly what causes the issue).
None of these are cases where alignment was extremely hard TBC, though I think it might be extremely hard to consistently avoid all alignment problems of this rough character before release. It’s unclear whether this sort of thing is a good analogy for misalignment in future models which would be catastrophic.
Yeah, I was thinking of reward hacking as another example of a problem we can solve if we try but companies aren’t prioritizing it, which isn’t a huge deal at the moment but could be very bad if the AIs were much smarter and more power-seeking.
Stepping back, there’s a worldview where any weird, undesired behavior no matter how minor is scary because we need to get alignment perfectly right; and another where we should worry about scheming, deception, and related behaviors but it’s not a big deal (at least safety-wise) if the model misunderstands our instructions in bizarre ways. Either of these can be justified but this discussion could probably use more clarity about which one we’re all coming from.
You’re probably already tracking this, but the biggest cases of “alignment was actually pretty tricky” I’m aware are:
Recent systems doing egregious reward hacking in some cases (including o3, 3.7 sonnet, and 4 Opus). This problem has gotten better recently (and I currently expect it to mostly get better over time, prior to superhuman capabilities), but AI companies knew about the problem before release and couldn’t solve the problem quickly enough to avoid deploying a model with this property. And note this is pretty costly to consumers!
There are a bunch of aspects of current AI propensities which are undesired and AI companies don’t know how to reliably solve these in a way that will actually generalize to similar such problems. For instance, see the model card for opus 4 which includes the model doing a bunch of undesired stuff that Anthropic doesn’t want but also can’t easily avoid except via patching it non-robustly (because they don’t necessarily know exactly what causes the issue).
None of these are cases where alignment was extremely hard TBC, though I think it might be extremely hard to consistently avoid all alignment problems of this rough character before release. It’s unclear whether this sort of thing is a good analogy for misalignment in future models which would be catastrophic.
Yeah, I was thinking of reward hacking as another example of a problem we can solve if we try but companies aren’t prioritizing it, which isn’t a huge deal at the moment but could be very bad if the AIs were much smarter and more power-seeking.
Stepping back, there’s a worldview where any weird, undesired behavior no matter how minor is scary because we need to get alignment perfectly right; and another where we should worry about scheming, deception, and related behaviors but it’s not a big deal (at least safety-wise) if the model misunderstands our instructions in bizarre ways. Either of these can be justified but this discussion could probably use more clarity about which one we’re all coming from.