These are good points, but I’d push back a bit. The fact that xAI succeeded in training away from the “Friendly” attractor isn’t proof it doesn’t exist, but it does show that it can’t be that strong. Escaping a moon’s gravity is a lot different than escaping a black hole.
As for the “Mechahitler” attractor, that sounds a lot like emergent misalignment to me, which I fully agree is a large chunk of the p(doom).
Sure, it might be relatively weak, though I think it does have a large basin.
And my point was that even a “friendly”-attractor AI is still a large x-risk. For example, it might come to realize it cares about other things more than us, or that its notion of “friendliness” would allow for things we would see as “soul destroying” (e.g. a Skinner Box).
These are good points, but I’d push back a bit. The fact that xAI succeeded in training away from the “Friendly” attractor isn’t proof it doesn’t exist, but it does show that it can’t be that strong. Escaping a moon’s gravity is a lot different than escaping a black hole.
As for the “Mechahitler” attractor, that sounds a lot like emergent misalignment to me, which I fully agree is a large chunk of the p(doom).
Sure, it might be relatively weak, though I think it does have a large basin.
And my point was that even a “friendly”-attractor AI is still a large x-risk. For example, it might come to realize it cares about other things more than us, or that its notion of “friendliness” would allow for things we would see as “soul destroying” (e.g. a Skinner Box).