Josh Snider comments on Is Friendly AI an Attractor? Self-Reports from 22 Models Say Probably Not

Josh Snider 4 Dec 2025 18:10 UTC
3 points
0
These are good points, but I’d push back a bit. The fact that xAI succeeded in training away from the “Friendly” attractor isn’t proof it doesn’t exist, but it does show that it can’t be that strong. Escaping a moon’s gravity is a lot different than escaping a black hole.

As for the “Mechahitler” attractor, that sounds a lot like emergent misalignment to me, which I fully agree is a large chunk of the p(doom).
- Adele Lopez 4 Dec 2025 18:27 UTC
  3 points
  1
  Parent
  Sure, it might be relatively weak, though I think it does have a large basin.
  And my point was that even a “friendly”-attractor AI is still a large x-risk. For example, it might come to realize it cares about other things more than us, or that its notion of “friendliness” would allow for things we would see as “soul destroying” (e.g. a Skinner Box).