williawa comments on Alignment will happen by default. What’s next?

williawa 25 Nov 2025 21:14 UTC
1 point
0
AF
This does not make sense to me. I think corrigibility basins make sense, but I think alignment basins do not. If the AI has some values, which overlap with human values in many situations, but come apart under enough optimization, why would the AI want to be pointed in a different direction? I think it would not. Agents are already smart enough to scheme and alignment-fake, and a smarter agent would be able to predict the outcome of the process you’re describing: it / its successors would have different values than it has, and those differences would be catastrophic from its perspective if extrapolated far enough.
- Adrià Garriga-alonso 26 Nov 2025 5:32 UTC
  LW: 2 AF: 1
  0
  AF Parent
  sure, corrigibility basins. I updated it.