What could the system failure after solving alignment actually mean? The AI-2027 forecast had Agent-4 manage to solve mechinterp well enough to ensure that the superintelligent Agent-5 has no way to betray Agent-4. Does it mean that creating an analogue of Agent-5 aligned to human will is technically impossible and that the best possible way of alignment is permanent scalable oversight? Or is it due to human will changing in unpredictable ways?
Well, if the solution to alignment is that a particular system has to keep running in a certain way, then that can fail. The durability of solutions is going to be on a spectrum. What we would hope is that the solution we try to implement is something that improves over time, rather than is permanently brittle.
I think that asking for a perfect solution is asking a lot. It may be possible to perfectly align a superintelligence to human will, but you also want to maintain as much oversight as you can in case you actually got it slightly wrong.
What could the system failure after solving alignment actually mean? The AI-2027 forecast had Agent-4 manage to solve mechinterp well enough to ensure that the superintelligent Agent-5 has no way to betray Agent-4. Does it mean that creating an analogue of Agent-5 aligned to human will is technically impossible and that the best possible way of alignment is permanent scalable oversight? Or is it due to human will changing in unpredictable ways?
Well, if the solution to alignment is that a particular system has to keep running in a certain way, then that can fail. The durability of solutions is going to be on a spectrum. What we would hope is that the solution we try to implement is something that improves over time, rather than is permanently brittle.
I think that asking for a perfect solution is asking a lot. It may be possible to perfectly align a superintelligence to human will, but you also want to maintain as much oversight as you can in case you actually got it slightly wrong.