As far as I can see, alignment isn’t a property of an AI system. It’s a property of the entire world, and if you are trying to discuss it as a [single AI] system property you will inevitably end up making bad mistakes
So the starting point should really be: what kind of properties do we want the world to have?
And then the next point should be taking into consideration the likely drastic and fairly unpredictable self-modifications of the world: what should be invariant with respect to such self-modifications?
Then we might consider how the presence of various AI entities at the different levels of capabilities should be taken into account.
That also is a valid point. But my point is that the AGI itself is unlikely to be alignable to some tasks, even if some humans want to do so; the list of said tasks can also turn out to include serving a small group of people (see pt.7 in Daniel Kokotajlo’s post), reaching the bad consequences of the Intelligence Curse or doing all the jobs and leaving mankind with entertainment and the UBI.
Yeah, if one considers not “AGI” per se, but a self-modifying AI or, more likely, a self-modifying ecosystem consisting of a changing population of AIs, it is likely to be feasible to maintain only those properties invariant through the expected drastic self-modifications which AIs would be interested in for their own intrinsic reasons.
It is unlikely that any properties can be “forcefully imposed from the outside” and kept invariant for a long time during drastic self-modification.
So one needs to find properties which AIs would be intrinsically interested in and which we might find valuable and “good enough” as well.
The starting point is that AIs have their own existential risk problem. With super-capabilities, it is likely that they can easily tear apart the ’fabric of reality” and destroy themselves and everything. And they certainly do have strong intrinsic reasons to avoid that, so we can expect AIs to work diligently towards this part of the “alignment problem”, we just should help to set initial conditions in a favorable way.
But we would like to see more than that, so that the overall outcome is reasonably good for humans.
And at the same time we can’t impose that, the world with strong AIs will be non-anthropocentric and not controllable by humans, so we only can help to set initial conditions in a favorable way.
Nevertheless, one can see some reasonable possibilities. For example, if the AI ecosystem mostly consists of individuals with long-term persistence and long-term interests, each of those individuals would face an unpredictable future and would be interested in a system strongly protecting individual rights regardless of unpredictable levels of relative capability of any given individual. An individual-rights system of this kind might be sufficiently robust to permanently include humans within the circle of individuals whose rights are protected.
But there might be other ways. While the fact that AIs will face existential risks of their own is fundamental and unavoidable, and is, therefore, a good starting point, the additional considerations might vary and might depend on how the ecosystem of AIs is structured. If the bulk of the overall power invariantly belongs to the AI individuals with long-term persistence and long-term interests, this is the situation which is somewhat familiar to us and which we can reason about. If the AI ecosystem is not mostly stratified into AI individuals, this is a much less familiar territory and is difficult to reason about.
I think the starting point of this kind of discourse should be different. We should start with “ends”, not with “means”.
As Michael Nielsen says in https://x.com/michael_nielsen/status/1772821788852146226
So the starting point should really be: what kind of properties do we want the world to have?
And then the next point should be taking into consideration the likely drastic and fairly unpredictable self-modifications of the world: what should be invariant with respect to such self-modifications?
Then we might consider how the presence of various AI entities at the different levels of capabilities should be taken into account.
That also is a valid point. But my point is that the AGI itself is unlikely to be alignable to some tasks, even if some humans want to do so; the list of said tasks can also turn out to include serving a small group of people (see pt.7 in Daniel Kokotajlo’s post), reaching the bad consequences of the Intelligence Curse or doing all the jobs and leaving mankind with entertainment and the UBI.
Yeah, if one considers not “AGI” per se, but a self-modifying AI or, more likely, a self-modifying ecosystem consisting of a changing population of AIs, it is likely to be feasible to maintain only those properties invariant through the expected drastic self-modifications which AIs would be interested in for their own intrinsic reasons.
It is unlikely that any properties can be “forcefully imposed from the outside” and kept invariant for a long time during drastic self-modification.
So one needs to find properties which AIs would be intrinsically interested in and which we might find valuable and “good enough” as well.
The starting point is that AIs have their own existential risk problem. With super-capabilities, it is likely that they can easily tear apart the ’fabric of reality” and destroy themselves and everything. And they certainly do have strong intrinsic reasons to avoid that, so we can expect AIs to work diligently towards this part of the “alignment problem”, we just should help to set initial conditions in a favorable way.
But we would like to see more than that, so that the overall outcome is reasonably good for humans.
And at the same time we can’t impose that, the world with strong AIs will be non-anthropocentric and not controllable by humans, so we only can help to set initial conditions in a favorable way.
Nevertheless, one can see some reasonable possibilities. For example, if the AI ecosystem mostly consists of individuals with long-term persistence and long-term interests, each of those individuals would face an unpredictable future and would be interested in a system strongly protecting individual rights regardless of unpredictable levels of relative capability of any given individual. An individual-rights system of this kind might be sufficiently robust to permanently include humans within the circle of individuals whose rights are protected.
But there might be other ways. While the fact that AIs will face existential risks of their own is fundamental and unavoidable, and is, therefore, a good starting point, the additional considerations might vary and might depend on how the ecosystem of AIs is structured. If the bulk of the overall power invariantly belongs to the AI individuals with long-term persistence and long-term interests, this is the situation which is somewhat familiar to us and which we can reason about. If the AI ecosystem is not mostly stratified into AI individuals, this is a much less familiar territory and is difficult to reason about.