I went through a bunch of similar thoughts before writing the self-unalignment problem. When we talked about this many years ago with Paul my impression was this is actually somewhat cruxy and we disagree about self-unalignment - where my mental image is if you start with an incoherent bundle of self-conflicted values, and you plug this into IDA-like dynamic, my intuition is you can end up in arbitrary places, including very bad. (Also cf. the part of Scott’s review of What We Owe To Future where he is worried that in a philosophy game, a smart moral philosopher can extrapolate his values to ‘I have to have my eyes pecked out by angry seagulls or something’ and hence does not want to play the game. AIs will likely be more powerful in this game than Will MacAskill)
My current position is we still don’t have a good answer, I don’t trust the response ‘we can just assume the problem away’, and also the response ‘this is just another problem which you can delegate to future systems’. On the other hand, existing AIs already seem doing a lot of value extrapolation and the results sometimes seem surprisingly sane, so, maybe we will get lucky, or larger part of morality is convergent—but it’s worth noting these value-extrapolating AIs are not necessarily what AI labs want or traditional alignment program aims for.
I pretty much agree you can end up in arbitrary places with extrapolated values, and I don’t think morality is convergent, but I also don’t think it matters for the purpose of existential risk, because assuming something like instruction following works, the extrapolation problem can be solved by ordering AIs not to extrapolate values to cases where they get tortured/killed in an ethical scenario, and more generally I don’t expect value extrapolation to matter for the purpose of making an AI safe to use.
The real impact is on CEV style alignment plans/plans for what to do with a future AI, which are really bad plans to do for a lot of people’s current values, and thus I really don’t want CEV to be the basis of alignment.
Thankfully, it’s unlikely to ever be this, but it still matters somewhat, especially since Anthropic is targeting value alignment (though thankfully there is implicit constraints/grounding based on the values chosen).
I went through a bunch of similar thoughts before writing the self-unalignment problem. When we talked about this many years ago with Paul my impression was this is actually somewhat cruxy and we disagree about self-unalignment - where my mental image is if you start with an incoherent bundle of self-conflicted values, and you plug this into IDA-like dynamic, my intuition is you can end up in arbitrary places, including very bad. (Also cf. the part of Scott’s review of What We Owe To Future where he is worried that in a philosophy game, a smart moral philosopher can extrapolate his values to ‘I have to have my eyes pecked out by angry seagulls or something’ and hence does not want to play the game. AIs will likely be more powerful in this game than Will MacAskill)
My current position is we still don’t have a good answer, I don’t trust the response ‘we can just assume the problem away’, and also the response ‘this is just another problem which you can delegate to future systems’. On the other hand, existing AIs already seem doing a lot of value extrapolation and the results sometimes seem surprisingly sane, so, maybe we will get lucky, or larger part of morality is convergent—but it’s worth noting these value-extrapolating AIs are not necessarily what AI labs want or traditional alignment program aims for.
I pretty much agree you can end up in arbitrary places with extrapolated values, and I don’t think morality is convergent, but I also don’t think it matters for the purpose of existential risk, because assuming something like instruction following works, the extrapolation problem can be solved by ordering AIs not to extrapolate values to cases where they get tortured/killed in an ethical scenario, and more generally I don’t expect value extrapolation to matter for the purpose of making an AI safe to use.
The real impact is on CEV style alignment plans/plans for what to do with a future AI, which are really bad plans to do for a lot of people’s current values, and thus I really don’t want CEV to be the basis of alignment.
Thankfully, it’s unlikely to ever be this, but it still matters somewhat, especially since Anthropic is targeting value alignment (though thankfully there is implicit constraints/grounding based on the values chosen).