I think this is a very useful post that is talking about many of the right things. One question though: isn’t it only worth focusing on the worlds where iterative design does not work for alignment to the extent to which progress can still be made towards mitigating those worlds? It appears to me that progress in technical fields is usually accomplished through iterative design, so it makes sense to have a high prior on non-iterative approaches being less effective. Depending on your specific numbers here, it seems like it could be worth it to pay attention to the areas more tractable for iterative design or less. I think its also misleading to think of iterative design as either working or failing. Fields have gradations of ability for prompt and high-quality feedback and ability for repeated trials. It also seems like problems that initially seem hard to iterate on can often be formulated in ways that allows better iteration (like the ELK problem being formulated in a way that allows for testing toy solutions and counterexamples). I worry that trying to focus in an unnuanced way about worlds where iterative design fails may miss out on opportunities to formulate some of these hard problems in ways that might make them easier to iterate on.
Certainly making some of the hard-to-see stuff visible enough to iterate on is one of the main lines of attack; that’s the central reason why interpretability work is so valuable.
I think this is a very useful post that is talking about many of the right things. One question though: isn’t it only worth focusing on the worlds where iterative design does not work for alignment to the extent to which progress can still be made towards mitigating those worlds? It appears to me that progress in technical fields is usually accomplished through iterative design, so it makes sense to have a high prior on non-iterative approaches being less effective. Depending on your specific numbers here, it seems like it could be worth it to pay attention to the areas more tractable for iterative design or less. I think its also misleading to think of iterative design as either working or failing. Fields have gradations of ability for prompt and high-quality feedback and ability for repeated trials. It also seems like problems that initially seem hard to iterate on can often be formulated in ways that allows better iteration (like the ELK problem being formulated in a way that allows for testing toy solutions and counterexamples). I worry that trying to focus in an unnuanced way about worlds where iterative design fails may miss out on opportunities to formulate some of these hard problems in ways that might make them easier to iterate on.
Certainly making some of the hard-to-see stuff visible enough to iterate on is one of the main lines of attack; that’s the central reason why interpretability work is so valuable.