I am also somewhat dissatisfied with the basin of attraction metaphor, but for a slightly different reason.
I am concerned that an AI that functions as mostly corrigible in environments that resemble the training environment will be less corrigible when the environment changes significantly.
I’m guessing that a better metaphor would be based on evolutionary pressures. That would emphasize both the uncertainties about any given change, and the sensitivity to out-of-distribution environments.
Maybe a metaphor about how cats are sometimes selected for being friendly to humans? Or the forces that led to the peacock’s tail?
I am also somewhat dissatisfied with the basin of attraction metaphor, but for a slightly different reason.
I am concerned that an AI that functions as mostly corrigible in environments that resemble the training environment will be less corrigible when the environment changes significantly.
I’m guessing that a better metaphor would be based on evolutionary pressures. That would emphasize both the uncertainties about any given change, and the sensitivity to out-of-distribution environments.
Maybe a metaphor about how cats are sometimes selected for being friendly to humans? Or the forces that led to the peacock’s tail?