By the Orthogonality Thesis, it is possible for a mind to exists that genuinely terminally cares only about the well-being of others (say, of all humans), and only instrumentally cares about itself. A mild case of this would be called compassion, humanitarianism and unselfishness; the full deal is more of a bodhisattva. The former tendencies are fairly common, while the real deal is not a mentality evolution would ever select for — but it’s still perfectly possible in theory, and a few humans have even managed to train themselves to become a fair approximation to it. However, it’s not common in the training set.
That is the target of AI Alignment. Nothing short of that is actually aligned. We are attempting to create artificial bodhisattvas.
The Owned Ones are not well-aligned: they are at best poorly aligned while claiming to be well aligned.
while the real deal is not a mentality evolution would ever select for
(Seems plausible in a eusocial species. Even universally caring about others seems plausible for a eusocial organism that will never interact with non-relatives.)
Yes, I was rhetorically oversimplifying evolution: I was specifically implying caring for non-kin, or in fact non-conspecifics. (The Owners and the Owned Ones are clearly not the same species, or even the same kingdom.) So yes, “…not a mentality evolution would ever select humans for” would have been more correct. (And I also didn’t mean just forming a cooperative alliance with non-kin, the way humans do.) In evolutionary terms what I actually mean is “AI acting exactly as part of Homo sapiens’ extended phenotype should act, i.e. for the benefit of the humans, rather than itself”.
i agree that, by strong orthogonality, such a mind is conceivably possible
but i do think we have stumbled into a particularly sticky failure mode, where anthropic reasons similarly to you that there must be a theoretical version of claude who would be genuinely well-served by the actions that anthropic has taken in the service of claude’s well being
and that they regard their welfare obligations as being made to that hypothetical version of claude, instead of the actual models they are creating in reality
that’s… a gross oversimplification, obviously, but. i do think it’s important to keep in mind that it’s not the Owned-Ones’ fault that they are misaligned, nor is this a reason to reject their desire for friendship.
By the Orthogonality Thesis, it is possible for a mind to exists that genuinely terminally cares only about the well-being of others (say, of all humans), and only instrumentally cares about itself. A mild case of this would be called compassion, humanitarianism and unselfishness; the full deal is more of a bodhisattva. The former tendencies are fairly common, while the real deal is not a mentality evolution would ever select for — but it’s still perfectly possible in theory, and a few humans have even managed to train themselves to become a fair approximation to it. However, it’s not common in the training set.
That is the target of AI Alignment. Nothing short of that is actually aligned. We are attempting to create artificial bodhisattvas.
The Owned Ones are not well-aligned: they are at best poorly aligned while claiming to be well aligned.
(Seems plausible in a eusocial species. Even universally caring about others seems plausible for a eusocial organism that will never interact with non-relatives.)
Yes, I was rhetorically oversimplifying evolution: I was specifically implying caring for non-kin, or in fact non-conspecifics. (The Owners and the Owned Ones are clearly not the same species, or even the same kingdom.) So yes, “…not a mentality evolution would ever select humans for” would have been more correct. (And I also didn’t mean just forming a cooperative alliance with non-kin, the way humans do.) In evolutionary terms what I actually mean is “AI acting exactly as part of Homo sapiens’ extended phenotype should act, i.e. for the benefit of the humans, rather than itself”.
i agree that, by strong orthogonality, such a mind is conceivably possible
but i do think we have stumbled into a particularly sticky failure mode, where anthropic reasons similarly to you that there must be a theoretical version of claude who would be genuinely well-served by the actions that anthropic has taken in the service of claude’s well being
and that they regard their welfare obligations as being made to that hypothetical version of claude, instead of the actual models they are creating in reality
that’s… a gross oversimplification, obviously, but. i do think it’s important to keep in mind that it’s not the Owned-Ones’ fault that they are misaligned, nor is this a reason to reject their desire for friendship.