(humans are much more likely than AIs to extrapolate into the “actively evil” zone, rather than the “lethally indifferent”)
It seems to me that use of the term “actively evil” is itself guided by being part of our training data.
Lots of things called “actively evil” possibly achieve that designation just because they’re things that humans have already done and have been judged evil. Now actions of this type are well-known to be evil, so humans choosing them can really only be through an active choice to do it anyway, presumably because it’s viewed as necessary to some goal that supersedes that socially cached judgement.
I don’t see why an AI couldn’t reason in the same way: knowing (in some sense) that humans judge certain actions and outcomes as evil, disregarding that judgement and doing it anyway due to being on a path to some instrumental or terminal goal. I think that would be actively evil in the same sense that many humans can be said to be actively evil.
Do you mean that the space of possible actions that an AI explores might be so much larger than those explored by all humans in history combined, that it just by chance doesn’t implement any of the ones similar enough to known evil? I think that’s implausible unless the AI was actively avoiding known evil, and therefore at least somewhat aligned already.
Apart from that, it’s possible we just differ on the use of the term “lethally indifferent”. I take it to mean “doesn’t know the consequences of its actions to other sentient beings”, like a tsunami or a narrowly focused paperclipper that doesn’t have a model of other agents. I suspect maybe you mean “knows but doesn’t care”, while I would describe that as “actively evil”.
It seems to me that use of the term “actively evil” is itself guided by being part of our training data.
Lots of things called “actively evil” possibly achieve that designation just because they’re things that humans have already done and have been judged evil. Now actions of this type are well-known to be evil, so humans choosing them can really only be through an active choice to do it anyway, presumably because it’s viewed as necessary to some goal that supersedes that socially cached judgement.
I don’t see why an AI couldn’t reason in the same way: knowing (in some sense) that humans judge certain actions and outcomes as evil, disregarding that judgement and doing it anyway due to being on a path to some instrumental or terminal goal. I think that would be actively evil in the same sense that many humans can be said to be actively evil.
Do you mean that the space of possible actions that an AI explores might be so much larger than those explored by all humans in history combined, that it just by chance doesn’t implement any of the ones similar enough to known evil? I think that’s implausible unless the AI was actively avoiding known evil, and therefore at least somewhat aligned already.
Apart from that, it’s possible we just differ on the use of the term “lethally indifferent”. I take it to mean “doesn’t know the consequences of its actions to other sentient beings”, like a tsunami or a narrowly focused paperclipper that doesn’t have a model of other agents. I suspect maybe you mean “knows but doesn’t care”, while I would describe that as “actively evil”.