sil comments on Legible vs. Illegible AI Safety Problems

sil 5 Nov 2025 8:27 UTC
3 points
0
I do not know if you consider gradual disempowerment to be an illegible problem in AI safety (as I do), but it is certainly a problem independent of corrigibility/alignment.
As such, work on either illegible or legible problems tackling alignment/corrigibility can cause the same effect; is AI safety worth pursuing when it could lead to a world with fundamental power shifts in the disfavor of most humans?
I think my brain was trying to figure out why I felt inexplicably bad upon hearing that Joe Carlsmith was joining Anthropic to work on alignment, despite repeatedly saying that I wanted to see more philosophers working on AI alignment/x-safety. I now realize what I really wanted was for philosophers, and more people in general, to work on the currently illegible problems, especially or initially by making them more legible.)
I agree heartily, and I feel there’s been various expressions of the “paradox” of alignment research, it is a balancing act of enabling accelerationism & safety. However ultimately both pursuits enable the end goal of aligned AI.
Which could optimistically lead to a utopia of post-scarcity but could also lead to highly dystopian power dynamics. Ensuring the optimist’s hope is realized seems (to me) to be a highly illegible problem. Those in the AI safety research space largely ignore this, in favor of tackling more legible problems, including illegible alignment problems.
All of this is to say I feel the same thing you feel, but for all of AI safety research.