I suggest something on Value Alignment itself. The actual problem of trying make a model have the values you want, be certain of it, be certain it will scale and other parts of the Hard Part of Alignment.
I agree on the object level that the hard barriers to alignment should be emphasized.
However, (based on this and other conversations with you) I think we have slightly different pictures of what this looks like in practice.
As an allegory, many computational complexity researchers would like to solve P vs NP. Like you do here, they would emphasize working on directions that have a chance of solving the hard part of the problem. However, they would not necessarily advocate that people just need to “work on the hard part of P vs NP.” In fact, for better or worse I think it’s actually considered slightly cringe to say you are trying to solve P vs NP. The reason is that the direct approaches are all known to fail, in the strong sense that there are proofs that many entire classes of proof techniques can’t work. So actually trying to solve P vs NP often looks like working on something very indirectly related to P vs NP, which you know must be resolved before (or along with) P vs NP for convoluted reasons, but would not itself directly resolve the question. It’s just something that might help you get traction. An example is algebraic complexity (though some people might work on it for its own sake as well). Another example is/was circuit lower bounds (though according to none other than Professor Barak this lately seems blocked).
I view the situation as similar in alignment. Many people have thought about how to just solve value alignment, including me, and realized they were not prepared to write down a solution. Then we switched to working on things which hopefully DO address the hard part, but don’t necessarily have a complete path to resolving the problem—rather we hope they can tractable make nonzero progress on the hard part.
This seems generally correct to me. I think that alignment is going to be easier to solve than P vs NP but that might be because I’m just ignorant about how hard doing research on the hard part actually is.
It does seem to me that there still aren’t good resources even clearly saying what the hard part of alignment is, let alone clearly explaining how to do research on it.
So I think there’s a lot of things that can be done to make it easier and test just how hard it really is.
I suggest something on Value Alignment itself. The actual problem of trying make a model have the values you want, be certain of it, be certain it will scale and other parts of the Hard Part of Alignment.
I agree on the object level that the hard barriers to alignment should be emphasized.
However, (based on this and other conversations with you) I think we have slightly different pictures of what this looks like in practice.
As an allegory, many computational complexity researchers would like to solve P vs NP. Like you do here, they would emphasize working on directions that have a chance of solving the hard part of the problem. However, they would not necessarily advocate that people just need to “work on the hard part of P vs NP.” In fact, for better or worse I think it’s actually considered slightly cringe to say you are trying to solve P vs NP. The reason is that the direct approaches are all known to fail, in the strong sense that there are proofs that many entire classes of proof techniques can’t work. So actually trying to solve P vs NP often looks like working on something very indirectly related to P vs NP, which you know must be resolved before (or along with) P vs NP for convoluted reasons, but would not itself directly resolve the question. It’s just something that might help you get traction. An example is algebraic complexity (though some people might work on it for its own sake as well). Another example is/was circuit lower bounds (though according to none other than Professor Barak this lately seems blocked).
I view the situation as similar in alignment. Many people have thought about how to just solve value alignment, including me, and realized they were not prepared to write down a solution. Then we switched to working on things which hopefully DO address the hard part, but don’t necessarily have a complete path to resolving the problem—rather we hope they can tractable make nonzero progress on the hard part.
This seems generally correct to me. I think that alignment is going to be easier to solve than P vs NP but that might be because I’m just ignorant about how hard doing research on the hard part actually is.
It does seem to me that there still aren’t good resources even clearly saying what the hard part of alignment is, let alone clearly explaining how to do research on it.
So I think there’s a lot of things that can be done to make it easier and test just how hard it really is.