We can think of alignment as roughly being decomposed into two “gaps” that we are trying to reduce:
1. The gap between proposed theoretical alignment approaches (such as iterated amplification) and what we might do without such techniques (aka the <@unaligned benchmark@>(@An unaligned benchmark@)) 2. The gap between actual implementations of alignment approaches, and what those approaches are theoretically capable of.
(This distinction is fuzzy. For example, the author puts “the technique can’t answer NP-hard questions” into the second gap while I would have had it in the first gap.)
We can think of some disagreements in AI alignment as different pictures about how these gaps look:
1. A stereotypical “ML-flavored alignment researcher” thinks that the first gap is very small, because in practice the model will generalize appropriately to new, more complex situations, and continue to do what we want. Such people would then be more focused on narrowing the second gap, by working on practical implementations. 2. A stereotypical “MIRI-flavored alignment researcher” thinks that the first gap is huge, such that it doesn’t really matter if you narrow the second gap, because even if you reduced that gap to zero you would still be doomed with near certainty.
Other notes:
I think some people think that amplified humans are actually just as capable as the unaligned benchmark. I think this is basically the factored cognition hypothesis.
I don’t see how this is enough. Even if this were true, vanilla iterated amplification would only give you an average-case / on-distribution guarantee. Your alignment technique also needs to come with a worst-case / off-distribution guarantee. (Another way of thinking about this is that it needs to deal with potential inner alignment failures.)
By the time we get to AGI, will we have alignment techniques that are even slightly competitive? I think it’s pretty plausible the answer is no.
I’m not totally sure what you mean here by “alignment techniques”. Is this supposed to be “techniques that we can justify will be intent aligned in all situations”, or perhaps “techniques that empirically turn out to be intent aligned in all situations”? If so, I agree that the answer is plausibly (even probably) no.
But what we’ll actually do is some technique that is more capable that we don’t justifiably know is intent aligned, and (probably) wouldn’t be intent aligned in some exotic circumstances. But it still seems plausible that in practice we never hit those exotic circumstances (because those exotic circumstances never happen, or because we’ve retrained the model before we get to the exotic circumstances, etc), and it’s intent aligned in all the circumstances the model actually encounters.
I think I’m like 30% on the proposition that before AGI, we’re going to come up with some alignment scheme that just looks really good and clearly solves most of the problems with current schemes.
Fwiw if you mostly mean something that resolves the unaligned benchmark—theory gap, without relying on empirical generalization / contingent empirical facts we learn from experiments, and you require it to solve abstract problems like this, I feel more pessimistic, maybe 20% (which comes from starting at ~10% and then updating on “well if I had done this same reasoning in 2010 I think I would have been too pessimistic about the progress made since then”).
I agree with pretty much this whole comment, but do have one question:
But it still seems plausible that in practice we never hit those exotic circumstances (because those exotic circumstances never happen, or because we’ve retrained the model before we get to the exotic circumstances, etc), and it’s intent aligned in all the circumstances the model actually encounters.
Given that this is conditioned on us getting to AGI, wouldn’t the intuition here be that pretty much all the most valuable things such a system would do would fall under “exotic circumstances” with respect to any realistic training distribution? I might be assuming too much in saying that — e.g., I’m taking it for granted that anything we’d call an AGI could self-improve to the point of accessing states of the world that we wouldn’t be able to train it on; and also I’m assuming that the highest-reward states would probably be the these exotic / hard-to-access ones. But both of those do seem (to me) like they’d be the default expectation.
Or maybe you mean it seems plausible that, even under those exotic circumstances, an AGI may still be able to correctly infer our intent, and be incentivized to act in alignment with it?
There are lots and lots of exotic circumstances. We might get into a nuclear war. We might invent time travel. We might become digital uploads. We might decide democracy was a bad idea.
I agree that AGI will create exotic circumstances. But not all exotic circumstances will be created by AGI. I find it plausible that the AI systems fail in only a special few exotic circumstances, which aren’t the ones that are actually created by AGI.
I find it plausible that the AI systems fail in only a special few exotic circumstances, which aren’t the ones that are actually created by AGI.
This helps, and I think it’s the part I don’t currently have a great intuition for. My best attempt at steel-manning would be something like: “It’s plausible that an AGI will generalize correctly to distributions which it is itself responsible for bringing about.” (Where “correctly” here means “in a way that’s consistent with its builders’ wishes.”) And you could plausibly argue that an AGI would have a tendency to not induce distributions that it didn’t expect it would generalize correctly on, though I’m not sure if that’s the specific mechanism you had in mind.
It’s nothing quite so detailed as that. It’s more like “maybe in the exotic circumstances we actually encounter, the objective does generalize, but also maybe not; there isn’t a strong reason to expect one over the other”. (Which is why I only say it is plausible that the AI system works fine, rather than probable.)
You might think that the default expectation is that AI systems don’t generalize. But in the world where we’ve gotten an existential catastrophe, we know that the capabilities generalized to the exotic circumstance; it seems like whatever made the capabilities generalize could also make the objective generalize in that exotic circumstance.
Planned summary for the Alignment Newsletter:
Other notes:
I don’t see how this is enough. Even if this were true, vanilla iterated amplification would only give you an average-case / on-distribution guarantee. Your alignment technique also needs to come with a worst-case / off-distribution guarantee. (Another way of thinking about this is that it needs to deal with potential inner alignment failures.)
I’m not totally sure what you mean here by “alignment techniques”. Is this supposed to be “techniques that we can justify will be intent aligned in all situations”, or perhaps “techniques that empirically turn out to be intent aligned in all situations”? If so, I agree that the answer is plausibly (even probably) no.
But what we’ll actually do is some technique that is more capable that we don’t justifiably know is intent aligned, and (probably) wouldn’t be intent aligned in some exotic circumstances. But it still seems plausible that in practice we never hit those exotic circumstances (because those exotic circumstances never happen, or because we’ve retrained the model before we get to the exotic circumstances, etc), and it’s intent aligned in all the circumstances the model actually encounters.
Fwiw if you mostly mean something that resolves the unaligned benchmark—theory gap, without relying on empirical generalization / contingent empirical facts we learn from experiments, and you require it to solve abstract problems like this, I feel more pessimistic, maybe 20% (which comes from starting at ~10% and then updating on “well if I had done this same reasoning in 2010 I think I would have been too pessimistic about the progress made since then”).
I agree with pretty much this whole comment, but do have one question:
Given that this is conditioned on us getting to AGI, wouldn’t the intuition here be that pretty much all the most valuable things such a system would do would fall under “exotic circumstances” with respect to any realistic training distribution? I might be assuming too much in saying that — e.g., I’m taking it for granted that anything we’d call an AGI could self-improve to the point of accessing states of the world that we wouldn’t be able to train it on; and also I’m assuming that the highest-reward states would probably be the these exotic / hard-to-access ones. But both of those do seem (to me) like they’d be the default expectation.
Or maybe you mean it seems plausible that, even under those exotic circumstances, an AGI may still be able to correctly infer our intent, and be incentivized to act in alignment with it?
There are lots and lots of exotic circumstances. We might get into a nuclear war. We might invent time travel. We might become digital uploads. We might decide democracy was a bad idea.
I agree that AGI will create exotic circumstances. But not all exotic circumstances will be created by AGI. I find it plausible that the AI systems fail in only a special few exotic circumstances, which aren’t the ones that are actually created by AGI.
Got it, thanks!
This helps, and I think it’s the part I don’t currently have a great intuition for. My best attempt at steel-manning would be something like: “It’s plausible that an AGI will generalize correctly to distributions which it is itself responsible for bringing about.” (Where “correctly” here means “in a way that’s consistent with its builders’ wishes.”) And you could plausibly argue that an AGI would have a tendency to not induce distributions that it didn’t expect it would generalize correctly on, though I’m not sure if that’s the specific mechanism you had in mind.
It’s nothing quite so detailed as that. It’s more like “maybe in the exotic circumstances we actually encounter, the objective does generalize, but also maybe not; there isn’t a strong reason to expect one over the other”. (Which is why I only say it is plausible that the AI system works fine, rather than probable.)
You might think that the default expectation is that AI systems don’t generalize. But in the world where we’ve gotten an existential catastrophe, we know that the capabilities generalized to the exotic circumstance; it seems like whatever made the capabilities generalize could also make the objective generalize in that exotic circumstance.
I see. Okay, I definitely agree that makes sense under the “fails to generalize” risk model. Thanks Rohin!