Interpretabilty would certainly help a lot, but I am worried about our inability to recognize (or even agree) to leaving local optima that we believe are ‘aligned’.
Like when we force a child to go to bed by his parents when it doesn’t want to, because the parents know it is better for the child in the long run, but the child is still unable to extend his understanding to this wider dimension.
Ar some point, we might experience what we think of as misaligned behaviour when the A.I. is trying to push us out of a local optimum that we experience as ‘good’.
Similar to how current A.I.’s have a tendency to agree with you, even if it is against your best interests. Is such an A.I. properly aligned or not? In terms of short-term gains, yes, in terms of long-terms gains, probably not.
Would it be misaligned for the A.I. to fool us into going along with a short-term loss for a long-term benefit?
You make a number of interesting points.
Interpretabilty would certainly help a lot, but I am worried about our inability to recognize (or even agree) to leaving local optima that we believe are ‘aligned’.
Like when we force a child to go to bed by his parents when it doesn’t want to, because the parents know it is better for the child in the long run, but the child is still unable to extend his understanding to this wider dimension.
Ar some point, we might experience what we think of as misaligned behaviour when the A.I. is trying to push us out of a local optimum that we experience as ‘good’.
Similar to how current A.I.’s have a tendency to agree with you, even if it is against your best interests. Is such an A.I. properly aligned or not? In terms of short-term gains, yes, in terms of long-terms gains, probably not.
Would it be misaligned for the A.I. to fool us into going along with a short-term loss for a long-term benefit?