This does not make sense to me. I think corrigibility basins make sense, but I think alignment basins do not. If the AI has some values, which overlap with human values in many situations, but come apart under enough optimization, why would the AI want to be pointed in a different direction? I think it would not. Agents are already smart enough to scheme and alignment-fake, and a smarter agent would be able to predict the outcome of the process you’re describing: it / its successors would have different values than it has, and those differences would be catastrophic from its perspective if extrapolated far enough.
This does not make sense to me. I think corrigibility basins make sense, but I think alignment basins do not. If the AI has some values, which overlap with human values in many situations, but come apart under enough optimization, why would the AI want to be pointed in a different direction? I think it would not. Agents are already smart enough to scheme and alignment-fake, and a smarter agent would be able to predict the outcome of the process you’re describing: it / its successors would have different values than it has, and those differences would be catastrophic from its perspective if extrapolated far enough.
sure, corrigibility basins. I updated it.