Thomas Kwa comments on Alignment will happen by default. What’s next?

Thomas Kwa 27 Nov 2025 6:14 UTC
LW: 5 AF: 3
3
AF
Seems difficult for three reasons:
1. Many other companies are trying to build an AGI so a company trying to do this would have to first win the race and disempower everyone else; this means they will have fewer resources for safety
2. The AI will have to act antagonistically to humanity, so it could not be at all corrigible
3. The AI will need to act correctly until 2125, which is much farther out of distribution than any AI we have observed
Given these difficulties I’d put it below ⁵⁰⁄₅₀, but this challenge seems significantly harder than the one I think we will actually face, which is more like having an AI we can defer to that doesn’t try to sabotage the next stage of AI research, for each stage until the capability level that COULD disempower 2025 humanity, plus maybe other things to keep the balance of power stable.
Also I’m not sure what “drawn from the same distribution” means here, AI safety is trying dozens of theoretical and empirical directions, plus red teaming and model organisms can get much more elaborate with impractical resource investments.