I don’t think this is the core reason that alignment is hard—even if we had access to a bunch of evidence about AGI misbehavior now, I think it’d be very hard to convert that into a solution for alignment.
If I imagine that we magically had a boxing setup which let us experiment with powerful AGI alignment without dying, I do agree it would still be hard to solve alignment. But it wouldn’t be harder than the core problems of any other field of science/engineering. It wouldn’t be unusually hard, by the standards of technical research.
Of course, “empirical evidence of power-seeking behavior” is a lot weaker than a magical box. With only that level of empirical evidence, most of the “no empirical feedback” problem would still be present. More on that next.
Nor do I believe we’ll see no empirical evidence of power-seeking behavior before it’s too late (and I think opinions amongst alignment researchers are pretty divided on this question).
The key “lack of empirical feedback” property in Goodhart, deceptive alignment, hard left turn, get what you measure, etc, is this: for any given AI, it will look fine early on (e.g. in training or when optimization power is low) and then things will fall apart later on. If we are lucky enough to be in a very-slow-takeoff world, then an empirically-minded person might still notice that their AIs keep falling apart in deployment, and conclude that alignment is a problem. I don’t put very high probability on that (partly because of the very-slow-takeoff assumption and partly because scenarios like getting what we measure don’t necessarily look like a problem with the AI), but I buy it as a basically-plausible story.
But that doesn’t really change the problem that much, for multiple reasons (any one of which is sufficient):
Even if we put only a low probability on not getting a warning shot, we probably don’t want to pursue a strategy in which humanity goes extinct if we don’t get a fire alarm. Thinking we’ll probably get a warning shot makes sense to me; relying on a warning shot while plowing ahead building AGI is idiotic. The downside is far too large for the risk to make sense unless we are unrealistically confident that there will definitely be a warning shot.
Training AI in ways which will obviously incentivize it to hide problems (i.e. RLHF), and therefore make a warning shot less likely, is similarly foolish even if we think we’ll probably get a warning shot.
The failure mode I actually think is most likely: we do get a warning shot, and then people try to train away the problems until the problems cease to be visible. And that fails because of Goodheart, deception, hard left turn, getting what you measure, etc.
Psychologizing a bit: I suspect both RLHF and reliance on warning shots are symptoms of a more general cognitive pattern where people just don’t believe in anything they can’t see, and “iterate until we can’t see any problem” is very much the sort of strategy I expect such people to use. (I believe it’s also the strategy suggested by OP’s phrase “empirical methods that rely on capabilities”.) It’s not just about “not getting empirical evidence” in terms of a warning shot, it’s about not getting empirical evidence about alignment of any given powerful AGI until it’s too late, and that problem interacts very poorly with a mindset where people iterate a lot and don’t believe in problems they can’t see.
That’s the sort of problem which doesn’t apply in most scientific/engineering fields. In most fields, “iterate until we can’t see any problem” is a totally reasonable strategy. Alignment as a field is unusually hard because we can’t use that strategy; the core failure modes we’re worried about all involve problems which aren’t visible until later on the AI at hand.
If I imagine that we magically had a boxing setup which let us experiment with powerful AGI alignment without dying, I do agree it would still be hard to solve alignment. But it wouldn’t be harder than the core problems of any other field of science/engineering. It wouldn’t be unusually hard, by the standards of technical research.
Of course, “empirical evidence of power-seeking behavior” is a lot weaker than a magical box. With only that level of empirical evidence, most of the “no empirical feedback” problem would still be present. More on that next.
The key “lack of empirical feedback” property in Goodhart, deceptive alignment, hard left turn, get what you measure, etc, is this: for any given AI, it will look fine early on (e.g. in training or when optimization power is low) and then things will fall apart later on. If we are lucky enough to be in a very-slow-takeoff world, then an empirically-minded person might still notice that their AIs keep falling apart in deployment, and conclude that alignment is a problem. I don’t put very high probability on that (partly because of the very-slow-takeoff assumption and partly because scenarios like getting what we measure don’t necessarily look like a problem with the AI), but I buy it as a basically-plausible story.
But that doesn’t really change the problem that much, for multiple reasons (any one of which is sufficient):
Even if we put only a low probability on not getting a warning shot, we probably don’t want to pursue a strategy in which humanity goes extinct if we don’t get a fire alarm. Thinking we’ll probably get a warning shot makes sense to me; relying on a warning shot while plowing ahead building AGI is idiotic. The downside is far too large for the risk to make sense unless we are unrealistically confident that there will definitely be a warning shot.
Training AI in ways which will obviously incentivize it to hide problems (i.e. RLHF), and therefore make a warning shot less likely, is similarly foolish even if we think we’ll probably get a warning shot.
The failure mode I actually think is most likely: we do get a warning shot, and then people try to train away the problems until the problems cease to be visible. And that fails because of Goodheart, deception, hard left turn, getting what you measure, etc.
Psychologizing a bit: I suspect both RLHF and reliance on warning shots are symptoms of a more general cognitive pattern where people just don’t believe in anything they can’t see, and “iterate until we can’t see any problem” is very much the sort of strategy I expect such people to use. (I believe it’s also the strategy suggested by OP’s phrase “empirical methods that rely on capabilities”.) It’s not just about “not getting empirical evidence” in terms of a warning shot, it’s about not getting empirical evidence about alignment of any given powerful AGI until it’s too late, and that problem interacts very poorly with a mindset where people iterate a lot and don’t believe in problems they can’t see.
That’s the sort of problem which doesn’t apply in most scientific/engineering fields. In most fields, “iterate until we can’t see any problem” is a totally reasonable strategy. Alignment as a field is unusually hard because we can’t use that strategy; the core failure modes we’re worried about all involve problems which aren’t visible until later on the AI at hand.