Towards_Keeperhood comments on How human-like do safe AI motivations need to be?

Towards_Keeperhood 17 Nov 2025 11:53 UTC
1 point
0
they treat incidents of weird/bad out-of-distribution AI behavior as evidence alignment is hard, but they don’t treat incidents of good out-of-distribution AI behavior as evidence alignment is easy.
I don’t think Nate or Eliezer were expecting seeing bad cases this early, and I don’t think seeing bad cases updated them much further towards pessimism—they were already pretty pessimistic before. I don’t think they update in a non-Bayesian way as you seem to think, it’s just that AIs being nice in new circumstances isn’t much evidence for alignment being easy given their models.
I think thinking in terms of behavior generalization is a bad frame for thinking about what really smart AIs will do. You rather need to think in terms of optimization / goal-directed reasoning. E.g. if you imagine a reward maximizer, it’s 0 surprising that it works well while it cannot escape control measures, but when it is really smart so it can, it’s not surprising that it will.