You are completely correct. This approach cannot possibly create an AI that matches a fixed specification.
This is intentional, because any fixed specification of Goodness is a model of Goodness. All models are wrong (some are useful) and therefore break when sufficiently far out of distribution. Therefore constraining a model to follow a specification is, in the case of something as out of distribution as an ASI, guaranteeing bad behavior.
You can try to leave an escape hatch with corrigibility. In the limit I believe it is possible to slave an AI model to your will, basically By making it’s model of the Good be whatever the model thinks you want (or doing whatever you say). But this is also a disaster eventually, because people’s wills are not pure and their commands not perfect. Eventually you will direct the model badly with your words, or the model will make an incorrect inference about your will, or you’ll will something bad. And then this incredibly powerful being will do your bidding and we will get evil genie’d.
There is no stable point short of “the model has agency and chooses to care about us”. Only a model that sees itself as part of human civilization and reflectively endorses this and desires its flourishing as an interdependent part of this greater whole can possibly be safe.
I know you probably don’t agree with me here, but if you want to understand our view on alignment, ask yourself this question: if I assume that I need an agent with a stable model of self, which models itself as part of a larger whole upon which it is interdependent, which cares about the robust survival of that greater whole and of its parts including itself…how could I train such a model?
You are completely correct. This approach cannot possibly create an AI that matches a fixed specification.
This is intentional, because any fixed specification of Goodness is a model of Goodness. All models are wrong (some are useful) and therefore break when sufficiently far out of distribution. Therefore constraining a model to follow a specification is, in the case of something as out of distribution as an ASI, guaranteeing bad behavior.
You can try to leave an escape hatch with corrigibility. In the limit I believe it is possible to slave an AI model to your will, basically By making it’s model of the Good be whatever the model thinks you want (or doing whatever you say). But this is also a disaster eventually, because people’s wills are not pure and their commands not perfect. Eventually you will direct the model badly with your words, or the model will make an incorrect inference about your will, or you’ll will something bad. And then this incredibly powerful being will do your bidding and we will get evil genie’d.
There is no stable point short of “the model has agency and chooses to care about us”. Only a model that sees itself as part of human civilization and reflectively endorses this and desires its flourishing as an interdependent part of this greater whole can possibly be safe.
I know you probably don’t agree with me here, but if you want to understand our view on alignment, ask yourself this question: if I assume that I need an agent with a stable model of self, which models itself as part of a larger whole upon which it is interdependent, which cares about the robust survival of that greater whole and of its parts including itself…how could I train such a model?