riceissa comments on Legible vs. Illegible AI Safety Problems

riceissa 9 Nov 2025 7:59 UTC
LW: 7 AF: 3
0
AF
Echoing interstice’s sentiment here, but I feel like the core insight of this post was already understood by/implicit in what a bunch of AI safety people are doing. It seems to me an application of the replaceability logic that effective altruists have discussed in many places. Even I (who has been far away from AI safety discussions for a long time now) had essentially a “duh” reaction to this post (even though for a lot of your posts I have a “wow” reaction).
As for an explicit past discussion, this 2023 talk by Buck Shlegeris in my opinion contains the core logic, although he doesn’t use the legible/illegible terminology. In particular, one of the central points of the talk is how he chooses what to work on:
So here’s the question that I ask myself. Assume that a lab ends up in the situation described earlier [roughly: a lab is very close to creating and deploying a transformative AI, other labs are somewhat but not very far behind and even less thoughtful about risks] and they’re being as smart as possible about handling the alignment problems. How can I produce helpful changes to their alignment plan by doing technical research now?
Translated into the legible/illegible terminology, I interpret this question as something like “What problems are legible to me but illegible to AI labs currently (evidenced by them not already working on them), but will probably become legible to AI labs by the time they are about to deploy transformative AI?” (I realize there are a bunch of unstated assumptions in Buck’s talk, and also I am not Buck, so I am kind of doing quite a lot of my own interpretation here, so you might reasonably disagree that the talk contains your core logic. :)
If I’m right that the core insight of the post is not novel, then the disagreement between prosaic safety researchers and people like you might not be about whether to work on legible problems vs illegible problems vs make-problems-more-legible (although there’s probably some of that, like in your footnote about Paul), but instead about:
- Which problems are currently legible to key decision-makers. You think prosaic safety work in general is legible, so the thing to do is to work on philosophical questions which are illegible to almost everyone, while perhaps many prosaic safety people think that there are many prosaic safety problems that are illegible to purely capabilities researchers and policymakers and lab executives, and that the thing to do is to work on those prosaic safety problems.
- Which problems will become legible to them by the time they are about to deploy transformative AI. You are pessimistic about people’s ability to realize the importance of certain philosophical problems, so you don’t expect currently-illegible problems to become legible by the time TAI is deployed, whereas perhaps many prosaic safety people think that as AI becomes more and more capable, the alignment and other problems with AIs will automatically become more and more apparent to everyone (and in fact, that this has already been happening, with things like the release of ChatGPT, reporting of LLM psychosis, etc), so it’s less important to spend effort persuading people about illegible problems, than to just work on the currently-illegible-to-decision-makers problems, so that when the time comes for the research to be useful, it has already been done.
My own views are much closer to yours than to the prosaic-safety view I laid out above. In fact, after watching Buck’s talk in 2023, I wrote the following in a private conversation:
i feel like (even non-capabilities-advancing) people working on prosaic alignment and mechanistic interpretability have this attitude of like “well, i guess this is how AGI is gonna get built, so we might as well prepare for it by trying to align these kinds of models”, but this kind of attitude further entrenches this narrative and makes it more likely that prosaic AGI gets built. so there’s like a self-fulfilling prophecy type dynamic going on that worries me, and i think it’s a good chunk of why i’ve always been unexcited about working on prosaic alignment
i.e. not only does working on legible safety problems burn the remaining timeline, it is the very thing which hyperstitions “AI timelines” into existence in the first place.