Alignment as a field is hard precisely because we do not expect to see empirical evidence before it is too late.
I don’t think this is the core reason that alignment is hard—even if we had access to a bunch of evidence about AGI misbehavior now, I think it’d still be hard to convert that into a solution for alignment. Nor do I believe we’ll see no empirical evidence of power-seeking behavior before it’s too late (and I think opinions amongst alignment researchers are pretty divided on this question).
I don’t think this is the core reason that alignment is hard—even if we had access to a bunch of evidence about AGI misbehavior now, I think it’d be very hard to convert that into a solution for alignment.
If I imagine that we magically had a boxing setup which let us experiment with powerful AGI alignment without dying, I do agree it would still be hard to solve alignment. But it wouldn’t be harder than the core problems of any other field of science/engineering. It wouldn’t be unusually hard, by the standards of technical research.
Of course, “empirical evidence of power-seeking behavior” is a lot weaker than a magical box. With only that level of empirical evidence, most of the “no empirical feedback” problem would still be present. More on that next.
Nor do I believe we’ll see no empirical evidence of power-seeking behavior before it’s too late (and I think opinions amongst alignment researchers are pretty divided on this question).
The key “lack of empirical feedback” property in Goodhart, deceptive alignment, hard left turn, get what you measure, etc, is this: for any given AI, it will look fine early on (e.g. in training or when optimization power is low) and then things will fall apart later on. If we are lucky enough to be in a very-slow-takeoff world, then an empirically-minded person might still notice that their AIs keep falling apart in deployment, and conclude that alignment is a problem. I don’t put very high probability on that (partly because of the very-slow-takeoff assumption and partly because scenarios like getting what we measure don’t necessarily look like a problem with the AI), but I buy it as a basically-plausible story.
But that doesn’t really change the problem that much, for multiple reasons (any one of which is sufficient):
Even if we put only a low probability on not getting a warning shot, we probably don’t want to pursue a strategy in which humanity goes extinct if we don’t get a fire alarm. Thinking we’ll probably get a warning shot makes sense to me; relying on a warning shot while plowing ahead building AGI is idiotic. The downside is far too large for the risk to make sense unless we are unrealistically confident that there will definitely be a warning shot.
Training AI in ways which will obviously incentivize it to hide problems (i.e. RLHF), and therefore make a warning shot less likely, is similarly foolish even if we think we’ll probably get a warning shot.
The failure mode I actually think is most likely: we do get a warning shot, and then people try to train away the problems until the problems cease to be visible. And that fails because of Goodheart, deception, hard left turn, getting what you measure, etc.
Psychologizing a bit: I suspect both RLHF and reliance on warning shots are symptoms of a more general cognitive pattern where people just don’t believe in anything they can’t see, and “iterate until we can’t see any problem” is very much the sort of strategy I expect such people to use. (I believe it’s also the strategy suggested by OP’s phrase “empirical methods that rely on capabilities”.) It’s not just about “not getting empirical evidence” in terms of a warning shot, it’s about not getting empirical evidence about alignment of any given powerful AGI until it’s too late, and that problem interacts very poorly with a mindset where people iterate a lot and don’t believe in problems they can’t see.
That’s the sort of problem which doesn’t apply in most scientific/engineering fields. In most fields, “iterate until we can’t see any problem” is a totally reasonable strategy. Alignment as a field is unusually hard because we can’t use that strategy; the core failure modes we’re worried about all involve problems which aren’t visible until later on the AI at hand.
Huh, I thought you agreed with statements like “if we had many shots at AI Alignment and could get reliable empirical feedback on whether an AI Alignment solution is working, AI Alignment would be much easier”.
My model is that John is talking about “evidence on whether an AI alignment solution is sufficient”, and you understood him to say “evidence on whether the AI Alignment problem is real/difficult”. My guess is you both agree on the former, but I am not confident.
Huh, I thought you agreed with statements like “if we had many shots at AI Alignment and could get reliable empirical feedback on whether an AI Alignment solution is working, AI Alignment would be much easier”.
I agree that having many shots is helpful, but lacking them is not the core difficulty (just as having many shots to launch a rocket doesn’t help you very much if you have no idea how rockets work).
I don’t really know what “reliable empirical feedback” means in this context—if you have sufficiently reliable feedback mechanisms, then you’ve solved most of the alignment problem. But, out of the things John listed:
Goodhart problems in outer alignment, deception in inner alignment, phase change in hard takeoff, “getting what you measure” in slow takeoff
I expect that we’ll observe a bunch of empirical examples of each of these things happening (except for the hard takeoff phase change), and not know how to fix them.
I agree that having many shots is helpful, but lacking them is not the core difficulty (just as having many shots to launch a rocket doesn’t help you very much if you have no idea how rockets work).
I do really feel like it would have been really extremely hard to build rockets if we had to get it right on the very first try.
I think for rockets the fact that it is so costly to experiment with stuff, explains the majority of the difficulty of rocket engineering. I agree you also have very little chance to build a successful space rocket without having a good understanding of newtonian mechanics and some aspects of relativity, but I don’t know, if I could just launch a rocket every day without bad consequences, I am pretty sure I wouldn’t really need a deep understanding of either of those, or would easily figure out the relevant bits as I kept experimenting.
The reason why rocket science relies so much on having solid theoretical models is because we have to get things right in only a few shots. I don’t think you really needed any particularly good theory to build trains for example. Just a lot of attempts and tinkering.
At a sufficiently high level of abstraction, I agree that “cost of experimenting” could be seen as the core difficulty. But at a very high level of abstraction, many other things could also be seen as the core difficulty, like “our inability to coordinate as a civilization” or “the power of intelligence” or “a lack of interpretability”, etc. Given this, John’s comment seemed like mainly rhetorical flourishing rather than a contentful claim about the structure of the difficult parts of the alignment problem.
Also, I think that “on our first try” thing isn’t a great framing, because there are always precursors (e.g. we landed a man on the moon “on our first try” but also had plenty of tries at something kinda similar). Then the question is how similar, and how relevant, the precursors are—something where I expect our differing attitudes about the value of empiricism to be the key crux.
Well you could probably build a rocket that looks like it works, anyways. Could you build one you would want to try to travel to the moon in? (Are you imagining you get to fly in these rockets? Or just launch and watch from ground? I was imagining the 2nd...)
I agree that having many shots is helpful, but lacking them is not the core difficulty (just as having many shots to launch a rocket doesn’t help you very much if you have no idea how rockets work).
I basically buy that argument, though I do still think lack of shots is the main factor which makes alignment harder than most other technical fields in their preparadigmatic stage.
“Harder” can have two meanings: “the program (of design, and the proof) is longer” and “the program is less likely to be generated in the real world”. These meanings are correlated, but not identical.
I don’t think this is the core reason that alignment is hard—even if we had access to a bunch of evidence about AGI misbehavior now, I think it’d still be hard to convert that into a solution for alignment. Nor do I believe we’ll see no empirical evidence of power-seeking behavior before it’s too late (and I think opinions amongst alignment researchers are pretty divided on this question).
If I imagine that we magically had a boxing setup which let us experiment with powerful AGI alignment without dying, I do agree it would still be hard to solve alignment. But it wouldn’t be harder than the core problems of any other field of science/engineering. It wouldn’t be unusually hard, by the standards of technical research.
Of course, “empirical evidence of power-seeking behavior” is a lot weaker than a magical box. With only that level of empirical evidence, most of the “no empirical feedback” problem would still be present. More on that next.
The key “lack of empirical feedback” property in Goodhart, deceptive alignment, hard left turn, get what you measure, etc, is this: for any given AI, it will look fine early on (e.g. in training or when optimization power is low) and then things will fall apart later on. If we are lucky enough to be in a very-slow-takeoff world, then an empirically-minded person might still notice that their AIs keep falling apart in deployment, and conclude that alignment is a problem. I don’t put very high probability on that (partly because of the very-slow-takeoff assumption and partly because scenarios like getting what we measure don’t necessarily look like a problem with the AI), but I buy it as a basically-plausible story.
But that doesn’t really change the problem that much, for multiple reasons (any one of which is sufficient):
Even if we put only a low probability on not getting a warning shot, we probably don’t want to pursue a strategy in which humanity goes extinct if we don’t get a fire alarm. Thinking we’ll probably get a warning shot makes sense to me; relying on a warning shot while plowing ahead building AGI is idiotic. The downside is far too large for the risk to make sense unless we are unrealistically confident that there will definitely be a warning shot.
Training AI in ways which will obviously incentivize it to hide problems (i.e. RLHF), and therefore make a warning shot less likely, is similarly foolish even if we think we’ll probably get a warning shot.
The failure mode I actually think is most likely: we do get a warning shot, and then people try to train away the problems until the problems cease to be visible. And that fails because of Goodheart, deception, hard left turn, getting what you measure, etc.
Psychologizing a bit: I suspect both RLHF and reliance on warning shots are symptoms of a more general cognitive pattern where people just don’t believe in anything they can’t see, and “iterate until we can’t see any problem” is very much the sort of strategy I expect such people to use. (I believe it’s also the strategy suggested by OP’s phrase “empirical methods that rely on capabilities”.) It’s not just about “not getting empirical evidence” in terms of a warning shot, it’s about not getting empirical evidence about alignment of any given powerful AGI until it’s too late, and that problem interacts very poorly with a mindset where people iterate a lot and don’t believe in problems they can’t see.
That’s the sort of problem which doesn’t apply in most scientific/engineering fields. In most fields, “iterate until we can’t see any problem” is a totally reasonable strategy. Alignment as a field is unusually hard because we can’t use that strategy; the core failure modes we’re worried about all involve problems which aren’t visible until later on the AI at hand.
Huh, I thought you agreed with statements like “if we had many shots at AI Alignment and could get reliable empirical feedback on whether an AI Alignment solution is working, AI Alignment would be much easier”.
My model is that John is talking about “evidence on whether an AI alignment solution is sufficient”, and you understood him to say “evidence on whether the AI Alignment problem is real/difficult”. My guess is you both agree on the former, but I am not confident.
I agree that having many shots is helpful, but lacking them is not the core difficulty (just as having many shots to launch a rocket doesn’t help you very much if you have no idea how rockets work).
I don’t really know what “reliable empirical feedback” means in this context—if you have sufficiently reliable feedback mechanisms, then you’ve solved most of the alignment problem. But, out of the things John listed:
I expect that we’ll observe a bunch of empirical examples of each of these things happening (except for the hard takeoff phase change), and not know how to fix them.
I do really feel like it would have been really extremely hard to build rockets if we had to get it right on the very first try.
I think for rockets the fact that it is so costly to experiment with stuff, explains the majority of the difficulty of rocket engineering. I agree you also have very little chance to build a successful space rocket without having a good understanding of newtonian mechanics and some aspects of relativity, but I don’t know, if I could just launch a rocket every day without bad consequences, I am pretty sure I wouldn’t really need a deep understanding of either of those, or would easily figure out the relevant bits as I kept experimenting.
The reason why rocket science relies so much on having solid theoretical models is because we have to get things right in only a few shots. I don’t think you really needed any particularly good theory to build trains for example. Just a lot of attempts and tinkering.
At a sufficiently high level of abstraction, I agree that “cost of experimenting” could be seen as the core difficulty. But at a very high level of abstraction, many other things could also be seen as the core difficulty, like “our inability to coordinate as a civilization” or “the power of intelligence” or “a lack of interpretability”, etc. Given this, John’s comment seemed like mainly rhetorical flourishing rather than a contentful claim about the structure of the difficult parts of the alignment problem.
Also, I think that “on our first try” thing isn’t a great framing, because there are always precursors (e.g. we landed a man on the moon “on our first try” but also had plenty of tries at something kinda similar). Then the question is how similar, and how relevant, the precursors are—something where I expect our differing attitudes about the value of empiricism to be the key crux.
Well you could probably build a rocket that looks like it works, anyways. Could you build one you would want to try to travel to the moon in? (Are you imagining you get to fly in these rockets? Or just launch and watch from ground? I was imagining the 2nd...)
I basically buy that argument, though I do still think lack of shots is the main factor which makes alignment harder than most other technical fields in their preparadigmatic stage.
“Harder” can have two meanings: “the program (of design, and the proof) is longer” and “the program is less likely to be generated in the real world”. These meanings are correlated, but not identical.