Talking about broad basins in this community reminds me of the claims of corrigible AI: that if you could get an AGI that is at least a little bit willing to cooperate with humans in adjusting its own preferences to better align with ours, then it would fall into a broad basin of corrigibility and become more aligned with human values over time
I don’t suppose you could link me to a post arguing for this.
A benign act-based agent will be robustly corrigible if we want it to be.
A sufficiently corrigible agent will tend to become more corrigible and benign over time. Corrigibility marks out a broad basin of attraction towards acceptable outcomes.
I don’t suppose you could link me to a post arguing for this.
I was thinking of paulfchristiano’s articles on corrigibility (https://www.lesswrong.com/posts/fkLYhTQteAu5SinAc/corrigibility):