A lot of AI x-risk discussion is focused on worlds where iterative design fails. This makes sense, as “iterative design stops working” does in fact make problems much much harder to solve.
Maybe on LW, this seems way less true for lab alignment teams, open phil, and safety researchers in general.
Also, I think it’s worth noting the distinction between two different cases:
Iterative design against the problems you actually see in production fails.
Eliezer often equivocates between “you have to get alignment right on the first ‘critical’ try” and “you can’t learn anything about alignment from experimentation and failures before the critical try.” This distinction is very important, and I agree with the former but disagree with the latter. Solving a scientific problem without being able to learn from experiments and failures is incredibly hard. But we will be able to learn a lot about alignment from experiments and trial and error; I think we can get a lot of feedback about what works and deploy more traditional R&D methodology. We have toy models of alignment failures, we have standards for interpretability that we can’t yet meet, and we have theoretical questions we can’t yet answer.. The difference is that reality doesn’t force us to solve the problem, or tell us clearly which analogies are the right ones, and so it’s possible for us to push ahead and build AGI without solving alignment. Overall this consideration seems like it makes the institutional problem vastly harder, but does not have such a large effect on the scientific problem.
The quote from Paul sounds about right to me, with the caveat that I think it’s pretty likely that there won’t be a single try that is “the critical try”: something like this (also by Paul) seems pretty plausible to me, and it is cases like that that I particularly expect having existing but imperfect tooling for interpreting and steering ML models to be useful.
Maybe on LW, this seems way less true for lab alignment teams, open phil, and safety researchers in general.
Also, I think it’s worth noting the distinction between two different cases:
Iterative design against the problems you actually see in production fails.
Iterative design against carefully constructed test beds fails to result in safety in practice. (E.g. iterating against AI control test beds, model organisms, sandwiching setups, and other testbeds)
See also this quote from Paul from here:
The quote from Paul sounds about right to me, with the caveat that I think it’s pretty likely that there won’t be a single try that is “the critical try”: something like this (also by Paul) seems pretty plausible to me, and it is cases like that that I particularly expect having existing but imperfect tooling for interpreting and steering ML models to be useful.