Assumption 1: Most of us are not saints. Assumption 2: AI safety is a public good.[1]
[..simple standard incentives..]
Implication: The AI safety researcher, eventually finding himself rather too unlikely to individually be pivotal on either side, may rather ‘rationally’[2] switch to ‘standard’ AI work.[3]
So: A rather simple explanation seems to suffice to make sense of the big picture basic pattern you describe.
Doesn’t mean, the inner tension you point out isn’t interesting. But I don’t think very deep psychological factors needed to explain the general ‘AI safety becomes AI instead’ tendency, which I had the impression the post was meant to suggest.
I mean: individually ‘rational’ once we factor in another trait—Assumption 1b: The unfathomable scale of potential aggregate disutility from AI gone wrong, bottoms out into a constrained ‘negative’ individual utility in terms of the emotional value non-saint Joe places on it. So a 0.1 permille probability of saving the universe may individually rationally be dominated by mundane stuff like having an still somewhat cool and well-paying job or something.
The switch may psychologically be even easier if the employer had started out as actually well-intent and may now still have a bit of an ambiguous flair.
Assumption 1: Most of us are not saints.
Assumption 2: AI safety is a public good.[1]
[..simple standard incentives..]
Implication: The AI safety researcher, eventually finding himself rather too unlikely to individually be pivotal on either side, may rather ‘rationally’[2] switch to ‘standard’ AI work.[3]
So: A rather simple explanation seems to suffice to make sense of the big picture basic pattern you describe.
Doesn’t mean, the inner tension you point out isn’t interesting. But I don’t think very deep psychological factors needed to explain the general ‘AI safety becomes AI instead’ tendency, which I had the impression the post was meant to suggest.
Or, unaligned/unloving/whatever AGI a public bad.
I mean: individually ‘rational’ once we factor in another trait—Assumption 1b: The unfathomable scale of potential aggregate disutility from AI gone wrong, bottoms out into a constrained ‘negative’ individual utility in terms of the emotional value non-saint Joe places on it. So a 0.1 permille probability of saving the universe may individually rationally be dominated by mundane stuff like having an still somewhat cool and well-paying job or something.
The switch may psychologically be even easier if the employer had started out as actually well-intent and may now still have a bit of an ambiguous flair.