A lot of AI x-risk discussion is focused on worlds where iterative design fails. This makes sense, as “iterative design stops working” does in fact make problems much much harder to solve.
However, I think that even in the worlds where iterative design fails for safely creating an entire AGI, the worlds we succeed will be ones in which we were able to do iterative design on the components that safe AGI, and also able to do iterative design on the boundaries between subsystems, with the dangerous parts mocked out.
I am not optimistic about approaches that look like “do a bunch of math and philosophy to try to become less confused without interacting with the real world, and only then try to interact with the real world using your newfound knowledge”.
For the most part, I don’t think it’s a problem if people work on the math / philosophy approaches. However, to the extent that people want to stop people from doing empirical safety research on ML systems as they actually are in practice, I think that’s trading off a very marginal increase in the odds of success in worlds where iterative design could never work against a quite substantial decrease in the odds of success in worlds where iterative design could work. I am particularly thinking of things like interpretability / RLHF / constitutional AI as things which help a lot in worlds where iterative design could succeed.
A lot of AI x-risk discussion is focused on worlds where iterative design fails. This makes sense, as “iterative design stops working” does in fact make problems much much harder to solve.
Maybe on LW, this seems way less true for lab alignment teams, open phil, and safety researchers in general.
Also, I think it’s worth noting the distinction between two different cases:
Iterative design against the problems you actually see in production fails.
Eliezer often equivocates between “you have to get alignment right on the first ‘critical’ try” and “you can’t learn anything about alignment from experimentation and failures before the critical try.” This distinction is very important, and I agree with the former but disagree with the latter. Solving a scientific problem without being able to learn from experiments and failures is incredibly hard. But we will be able to learn a lot about alignment from experiments and trial and error; I think we can get a lot of feedback about what works and deploy more traditional R&D methodology. We have toy models of alignment failures, we have standards for interpretability that we can’t yet meet, and we have theoretical questions we can’t yet answer.. The difference is that reality doesn’t force us to solve the problem, or tell us clearly which analogies are the right ones, and so it’s possible for us to push ahead and build AGI without solving alignment. Overall this consideration seems like it makes the institutional problem vastly harder, but does not have such a large effect on the scientific problem.
The quote from Paul sounds about right to me, with the caveat that I think it’s pretty likely that there won’t be a single try that is “the critical try”: something like this (also by Paul) seems pretty plausible to me, and it is cases like that that I particularly expect having existing but imperfect tooling for interpreting and steering ML models to be useful.
However, to the extent that people want to stop people from doing empirical safety research on ML systems as they actually are in practice
Does anyone want to stop this? I think some people just contest the usefulness of improving RLHF / RLAIF / constitutional AI as safety research and also think that it has capabilties/profit externalities. E.g. see discussion here.
(I personally think this this research is probably net positive, but typically not very important to advance at current margins from an altruistic perspective.)
That said, “there exist such posts” is not really why I wrote this. The idea I really want to push back on is one that I have heard several times in IRL conversations, though I don’t know if I’ve ever seen it online. It goes like
There are two cars in a race. One is alignment, and one is capabilities. If the capabilities car hits the finish line first, we all die, and if the alignment car hits the finish line first, everything is good forever. Currently the capabilities car is winning. Some things, like RLHF and mechanistic interpretability research, speed up both cars. Speeding up both cars brings us closer to death, so those types of research are bad and we should focus on the types of research that only help alignment, like agent foundations. Also we should ensure that nobody else can do AI capabilities research.
Maybe almost nobody holds that set of beliefs! I am noticing now that my list of articles arguing that prosaic alignment strategies are harmful in expectation are by a pretty short list of authors.
A lot of AI x-risk discussion is focused on worlds where iterative design fails. This makes sense, as “iterative design stops working” does in fact make problems much much harder to solve.
However, I think that even in the worlds where iterative design fails for safely creating an entire AGI, the worlds we succeed will be ones in which we were able to do iterative design on the components that safe AGI, and also able to do iterative design on the boundaries between subsystems, with the dangerous parts mocked out.
I am not optimistic about approaches that look like “do a bunch of math and philosophy to try to become less confused without interacting with the real world, and only then try to interact with the real world using your newfound knowledge”.
For the most part, I don’t think it’s a problem if people work on the math / philosophy approaches. However, to the extent that people want to stop people from doing empirical safety research on ML systems as they actually are in practice, I think that’s trading off a very marginal increase in the odds of success in worlds where iterative design could never work against a quite substantial decrease in the odds of success in worlds where iterative design could work. I am particularly thinking of things like interpretability / RLHF / constitutional AI as things which help a lot in worlds where iterative design could succeed.
Maybe on LW, this seems way less true for lab alignment teams, open phil, and safety researchers in general.
Also, I think it’s worth noting the distinction between two different cases:
Iterative design against the problems you actually see in production fails.
Iterative design against carefully constructed test beds fails to result in safety in practice. (E.g. iterating against AI control test beds, model organisms, sandwiching setups, and other testbeds)
See also this quote from Paul from here:
The quote from Paul sounds about right to me, with the caveat that I think it’s pretty likely that there won’t be a single try that is “the critical try”: something like this (also by Paul) seems pretty plausible to me, and it is cases like that that I particularly expect having existing but imperfect tooling for interpreting and steering ML models to be useful.
Does anyone want to stop this? I think some people just contest the usefulness of improving RLHF / RLAIF / constitutional AI as safety research and also think that it has capabilties/profit externalities. E.g. see discussion here.
(I personally think this this research is probably net positive, but typically not very important to advance at current margins from an altruistic perspective.)
Yes, there are a number of posts to that effect.
That said, “there exist such posts” is not really why I wrote this. The idea I really want to push back on is one that I have heard several times in IRL conversations, though I don’t know if I’ve ever seen it online. It goes like
Maybe almost nobody holds that set of beliefs! I am noticing now that my list of articles arguing that prosaic alignment strategies are harmful in expectation are by a pretty short list of authors.