When I’ve tried to talk to alignment pollyannists about the “leap of death” / “failure under load” / “first critical try”, their first rejoinder is usually to deny that any such thing exists, because we can test in advance; they are denying the basic leap of required OOD generalization from failure-is-observable systems to failure-kills-the-observer systems.
I’m sure that some people have that rejoinder. I think more thoughtful people generally understand this point fine. [1] A few examples other than Buck:
Eliezer often equivocates between “you have to get alignment right on the first ‘critical’ try” and “you can’t learn anything about alignment from experimentation and failures before the critical try.” This distinction is very important, and I agree with the former but disagree with the latter.
Rohin (in the comments of Paul’s post):
I agree with almost all of this, in the sense that if you gave me these claims without telling me where they came from, I’d have actively agreed with the claims. [Followed by some exceptions that don’t include the “first critical try” thing.]
Generalization with no room for mistakes: you can’t safely test on the scenarios you actually care about (i.e., ones where the AI has a genuine takeover option), so your approach needs to generalize well to such scenarios on the first critical try (and the second, the third, etc).
TBC, I wouldn’t describe any of these people as “alignment pollyannists”, but I think they all have lower p(AI takeover) than Buck, so if you’re treating him as one then I guess you must think these count too.
To argue against an idea honestly, you should argue against the best arguments of the strongest advocates. Arguing against weaker advocates proves nothing, because even the strongest idea will attract weak advocates.
I’m sure that some people have that rejoinder. I think more thoughtful people generally understand this point fine. [1] A few examples other than Buck:
Paul:
Rohin (in the comments of Paul’s post):
Joe Carlsmith grants “first critical try” as one of the core difficulties in How might we solve the alignment problem:
He also talks about it more in-depth in On first critical tries in AI alignment.
Also Holden on the King Lear problem (and other problems) here.
TBC, I wouldn’t describe any of these people as “alignment pollyannists”, but I think they all have lower p(AI takeover) than Buck, so if you’re treating him as one then I guess you must think these count too.
If this comes as a surprise, then I think you’ve been arguing with the wrong people.