I agree that, at least for the more serious risks, there doesn’t seem to be consensus on what the mitigations should be.
For example, I’d be interested to know what proportion of alignment researchers would consider an AGI that’s a value learner (and of course has some initial model of human values created by humans to start that value learning process from) to have better outer-alignment safety properties that an AGI with a fixed utility function created by humans.
For me it very clear that the former is better, as it incentivizes the AGI to converge from its initial model of human values towards true human values, allowing it to fix problems when the initial model, say, goes out-of-distribution or doesn’t have sufficient detail. But I have no idea how much consensus there is on this, and I see a lot of alignment researchers working on approaches that don’t appear to assume that the AI system is a value learner.
I agree that, at least for the more serious risks, there doesn’t seem to be consensus on what the mitigations should be.
For example, I’d be interested to know what proportion of alignment researchers would consider an AGI that’s a value learner (and of course has some initial model of human values created by humans to start that value learning process from) to have better outer-alignment safety properties that an AGI with a fixed utility function created by humans.
For me it very clear that the former is better, as it incentivizes the AGI to converge from its initial model of human values towards true human values, allowing it to fix problems when the initial model, say, goes out-of-distribution or doesn’t have sufficient detail. But I have no idea how much consensus there is on this, and I see a lot of alignment researchers working on approaches that don’t appear to assume that the AI system is a value learner.