Lukas Finnveden comments on The Problem

Lukas Finnveden 16 Aug 2025 17:53 UTC
19 points
4
When I’ve tried to talk to alignment pollyannists about the “leap of death” / “failure under load” / “first critical try”, their first rejoinder is usually to deny that any such thing exists, because we can test in advance; they are denying the basic leap of required OOD generalization from failure-is-observable systems to failure-kills-the-observer systems.
I’m sure that some people have that rejoinder. I think more thoughtful people generally understand this point fine. ^[1] A few examples other than Buck:
Paul:
Eliezer often equivocates between “you have to get alignment right on the first ‘critical’ try” and “you can’t learn anything about alignment from experimentation and failures before the critical try.” This distinction is very important, and I agree with the former but disagree with the latter.
Rohin (in the comments of Paul’s post):
I agree with almost all of this, in the sense that if you gave me these claims without telling me where they came from, I’d have actively agreed with the claims. [Followed by some exceptions that don’t include the “first critical try” thing.]
Joe Carlsmith grants “first critical try” as one of the core difficulties in How might we solve the alignment problem:
- Generalization with no room for mistakes: you can’t safely test on the scenarios you actually care about (i.e., ones where the AI has a genuine takeover option), so your approach needs to generalize well to such scenarios on the first critical try (and the second, the third, etc).
He also talks about it more in-depth in On first critical tries in AI alignment.
Also Holden on the King Lear problem (and other problems) here.
1. ^
  TBC, I wouldn’t describe any of these people as “alignment pollyannists”, but I think they all have lower p(AI takeover) than Buck, so if you’re treating him as one then I guess you must think these count too.
What links here?
- Lukas Finnveden's comment on An epistemic advantage of working as a moderate by Buck (30 Aug 2025 17:15 UTC; 8 points)
- Lukas Finnveden 16 Aug 2025 19:38 UTC
  7 points
  3
  Parent
  If this comes as a surprise, then I think you’ve been arguing with the wrong people.
  To argue against an idea honestly, you should argue against the best arguments of the strongest advocates. Arguing against weaker advocates proves nothing, because even the strongest idea will attract weak advocates.