Daniel Kokotajlo comments on A non-review of “If Anyone Builds It, Everyone Dies”

Daniel Kokotajlo 16 Oct 2025 21:26 UTC
2 points
0
Thanks. Yeah I think the timelines were also a bit too aggressive, but overall things won’t look thaaat different in 2029 (my current median) or 2032 (the aggregate median of the rest of my team).

I think maybe my main disagreement with you has to do with the thing about making each generation of AIs more aligned than the previous one. A very important point, I think, is that we don’t have perfect evals for alignment, and probably won’t have perfect evals for alignment for some time. That is, our eval suites will catch some kinds of misalignment, but not others. So there will probably continue to be misalignments—including very major ones—that we don’t catch until it’s too late. So it’s unclear whether our AIs will be improving in alignment over time; they’ll probably be improving in apparent alignment, but who knows what’s happening with the kinds of misalignment we can’t effectively test for; those kinds could be getting worse and worse (and indeed we have some reason to think this will be happening; AI 2027 even gives a fairly concrete model according to which the misalignments get worse over time despite things looking better and better on evals. E.g. at first you are mostly just summoning personas using prompts, and that’s pretty benign, but as RL scales up that tends to get distorted and undermined by training incentives).