Wei Dai comments on Legible vs. Illegible AI Safety Problems

Wei Dai 6 Nov 2025 22:32 UTC
LW: 9 AF: 4
1
AF
Now that this post has >200 karma and still no one has cited a previous explicit discussion of its core logic, it strikes me just how terrible humans are at strategic thinking, relative to the challenge at hand, if no one among us in the 2-3 decades since AI x-risk became a subject of serious discussion, has written down what should be a central piece of strategic logic informing all prioritization of AI safety work. And it’s only a short inferential distance away from existing concepts and arguments (like legibility, capabilities work having negative EV). Some of us perhaps intuitively understood it, but neglected to or couldn’t write down the reasoning explicitly, which is almost as bad as completely missing it.

What other, perhaps slightly more complex or less obvious, crucial considerations are we still missing? What other implications follow from our low strategic competence?
What links here?
- Wei Dai's comment on Legible vs. Illegible AI Safety Problems by Wei Dai (6 Nov 2025 23:00 UTC; 3 points)
- Ben Pace 6 Nov 2025 23:29 UTC
  LW: 12 AF: 6
  6
  AF Parent
  I think Eliezer has oft-made the meta observation you are making now, that simple logical inferences take shockingly long to find in the space of possible inferences. I am reminded of him talking about how long backprop took.
  In 1969, Marvin Minsky and Seymour Papert pointed out that Perceptrons couldn’t learn the XOR function because it wasn’t linearly separable. This killed off research in neural networks for the next ten years.
  [...]
  Then along came this brilliant idea, called “backpropagation”:
  You handed the network a training input. The network classified it incorrectly. So you took the partial derivative of the output error (in layer N) with respect to each of the individual nodes in the preceding layer (N − 1). Then you could calculate the partial derivative of the output error with respect to any single weight or bias in the layer N − 1. And you could also go ahead and calculate the partial derivative of the output error with respect to each node in the layer N − 2. So you did layer N − 2, and then N − 3, and so on back to the input layer. (Though backprop nets usually had a grand total of 3 layers.) Then you just nudged the whole network a delta—that is, nudged each weight or bias by delta times its partial derivative with respect to the output error.
  It says a lot about the nonobvious difficulty of doing math that it took years to come up with this algorithm.
  I find it difficult to put into words just how obvious this is in retrospect. You’re just taking a system whose behavior is a differentiable function of continuous paramaters, and sliding the whole thing down the slope of the error function. There are much more clever ways to train neural nets, taking into account more than the first derivative, e.g. conjugate gradient optimization, and these take some effort to understand even if you know calculus. But backpropagation is ridiculously simple. Take the network, take the partial derivative of the error function with respect to each weight in the network, slide it down the slope.
  If I didn’t know the history of connectionism, and I didn’t know scientific history in general—if I had needed to guess without benefit of hindsight how long it ought to take to go from Perceptrons to backpropagation—then I would probably say something like: “Maybe a couple of hours? Lower bound, five minutes—upper bound, three days.”
  “Seventeen years” would have floored me.
  - Wei Dai 7 Nov 2025 7:02 UTC
    LW: 5 AF: 4
    0
    AF Parent
    But at the same time humans are able to construct intricate logical artifacts like the general number field sieve, which seems to require many more steps of longer inferential distance, and each step could only have been made by a small number of specialists in number theory or algebraic number theory available and thinking about factoring algorithms at the time. (Unlike the step in the OP, which seemingly anyone could have made.)
    
    Can you make sense of this?
    - Thane Ruthenis 9 Nov 2025 14:38 UTC
      LW: 4 AF: 1
      0
      AF Parent
      Can you make sense of this?
      Here’s a crack at it:
      The space of possible inferential steps is very high-dimensional, most steps are difficult, and there’s no known way to strongly bias your policy towards making simple-but-useful steps. Human specialists, therefore, could at best pick a rough direction that leads to accomplishing some goal they have, and then attempt random steps roughly pointed in that direction. Most of those random steps are difficult. A human succeeds if the step’s difficulty is below some threshold, and fails and goes back to square one otherwise. Over time, this results in a biased-random-walk process that stumbles upon a useful application once in a while. If one then looks back, one often sees a sequence of very difficult steps that led to this application (with a bias towards steps at the very upper end of what humans can tackle).
      In other words: The space of steps is more high-dimensional than human specialists are numerous, and our motion through it is fairly random. If one picks some state of human knowledge, and considers all directions in which anyone has ever attempted to move from that state, that wouldn’t produce a comprehensive map of that state’s neighbourhood. There’s therefore no reason to expect that all “low-hanging fruits” have been picked, because locating those low-hanging fruits is often harder than picking some high-hanging one.
      - Mateusz Bagiński 9 Nov 2025 16:50 UTC
        2 points
        0
        Parent
        Generally agree with the caveat that...
        A human succeeds if the step’s difficulty is below some threshold, and fails and goes back to square one otherwise.
        ...the difficulty of a step is generally somewhat dependent on some contingent properties of a given human mind.
    - Mateusz Bagiński 8 Nov 2025 13:15 UTC
      2 points
      0
      Parent
      At this point, I am not surprised by this sort of thing at all, only semi-ironically amused, but I’m not sure whether I can convey why it’s not surprising to me at all (although I surely would be surprised by this if somebody made it salient to me some 5 or 10 years ago).
      Perhaps I just got inoculated by reading about people making breakthroughs with simple or obvious in-hindsight concepts or even hearing ideas from people that I thought were obviously relevant/valuable to have in one’s portfolio of models, even though for some reason I hadn’t had it until then, or at least it had been less salient to me than it should have.
      Huxley’s famous response to the idea of natural selection was “How extremely stupid not to have thought of that!”
      Anders Sandberg said that he had had all the pieces of the Grabby Aliens model on the table and only failed to think of an obvious way to put them together.
      One frame (of unclear value) I have for this kind of thing is that the complexity/salience/easiness-to-find of an idea before and after is different because, well, a bunch of stuff in the mind is different.
  - Petropolitan 9 Nov 2025 12:39 UTC
    3 points
    0
    Parent
    A quick side note: in the 17 years which have passed since the post you cite had been written historiography of connectionism moved on, and we now know that modern backpropagation was invented as early as 1970 and first applied to neural nets in 1982 (technology transfer was much harder before web search!), see https://en.wikipedia.org/wiki/Backpropagation#Modern_backpropagation and references thereof
- Anthony DiGiovanni 7 Nov 2025 12:11 UTC
  8 points
  0
  Parent
  What other, perhaps slightly more complex or less obvious, crucial considerations are we still missing?
  I agree this is very important. I’ve argued that if we appropriately price in missing crucial considerations,^[1] we should consider ourselves clueless about AI risk interventions (here and here).
  1. ^
    Also relatively prosaic causal pathways we haven’t thought of in detail, not just high-level “considerations” per se.
  - Wei Dai 8 Nov 2025 6:52 UTC
    4 points
    0
    Parent
    Thanks, I’ve seen/skimmed your sequence. I think I agree directionally but not fully with your conclusions, but am unsure. My current thinking is that humanity clearly shouldn’t be attempting an AI transition now, and stopping AI development has the least problems with unawareness (it involves the least radical changes and therefore is easiest to predict / steer, is least likely to have some unforeseen strategic complications), and then once that’s achieved, we should carefully and patiently try to figure out all the crucial considerations until it looks like we’ve finally found all of the most important ones, and only then attempt an AI transition.
- riceissa 9 Nov 2025 7:59 UTC
  LW: 7 AF: 3
  0
  AF Parent
  Echoing interstice’s sentiment here, but I feel like the core insight of this post was already understood by/implicit in what a bunch of AI safety people are doing. It seems to me an application of the replaceability logic that effective altruists have discussed in many places. Even I (who has been far away from AI safety discussions for a long time now) had essentially a “duh” reaction to this post (even though for a lot of your posts I have a “wow” reaction).
  As for an explicit past discussion, this 2023 talk by Buck Shlegeris in my opinion contains the core logic, although he doesn’t use the legible/illegible terminology. In particular, one of the central points of the talk is how he chooses what to work on:
  So here’s the question that I ask myself. Assume that a lab ends up in the situation described earlier [roughly: a lab is very close to creating and deploying a transformative AI, other labs are somewhat but not very far behind and even less thoughtful about risks] and they’re being as smart as possible about handling the alignment problems. How can I produce helpful changes to their alignment plan by doing technical research now?
  Translated into the legible/illegible terminology, I interpret this question as something like “What problems are legible to me but illegible to AI labs currently (evidenced by them not already working on them), but will probably become legible to AI labs by the time they are about to deploy transformative AI?” (I realize there are a bunch of unstated assumptions in Buck’s talk, and also I am not Buck, so I am kind of doing quite a lot of my own interpretation here, so you might reasonably disagree that the talk contains your core logic. :)
  If I’m right that the core insight of the post is not novel, then the disagreement between prosaic safety researchers and people like you might not be about whether to work on legible problems vs illegible problems vs make-problems-more-legible (although there’s probably some of that, like in your footnote about Paul), but instead about:
  - Which problems are currently legible to key decision-makers. You think prosaic safety work in general is legible, so the thing to do is to work on philosophical questions which are illegible to almost everyone, while perhaps many prosaic safety people think that there are many prosaic safety problems that are illegible to purely capabilities researchers and policymakers and lab executives, and that the thing to do is to work on those prosaic safety problems.
  - Which problems will become legible to them by the time they are about to deploy transformative AI. You are pessimistic about people’s ability to realize the importance of certain philosophical problems, so you don’t expect currently-illegible problems to become legible by the time TAI is deployed, whereas perhaps many prosaic safety people think that as AI becomes more and more capable, the alignment and other problems with AIs will automatically become more and more apparent to everyone (and in fact, that this has already been happening, with things like the release of ChatGPT, reporting of LLM psychosis, etc), so it’s less important to spend effort persuading people about illegible problems, than to just work on the currently-illegible-to-decision-makers problems, so that when the time comes for the research to be useful, it has already been done.
  My own views are much closer to yours than to the prosaic-safety view I laid out above. In fact, after watching Buck’s talk in 2023, I wrote the following in a private conversation:
  i feel like (even non-capabilities-advancing) people working on prosaic alignment and mechanistic interpretability have this attitude of like “well, i guess this is how AGI is gonna get built, so we might as well prepare for it by trying to align these kinds of models”, but this kind of attitude further entrenches this narrative and makes it more likely that prosaic AGI gets built. so there’s like a self-fulfilling prophecy type dynamic going on that worries me, and i think it’s a good chunk of why i’ve always been unexcited about working on prosaic alignment
  i.e. not only does working on legible safety problems burn the remaining timeline, it is the very thing which hyperstitions “AI timelines” into existence in the first place.
- interstice 8 Nov 2025 6:19 UTC
  4 points
  0
  Parent
  Isn’t a version of this logic kinda implicit in what people are already doing? Like the MIRI switch to outreach could be seen as trying to make arguments already understood in the AI safety community legible to the wider public. Or put another way, legibility is a two-place word, and the degree of “legibility of AI concerns” present in the xrisk-adjacent community is already sufficient to imply that we shouldn’t be building AI given our current level of knowledge. Like if the median voter had the degree of legible-understanding-of-AI-xrisk of Dario(probably, behind closed doors at least? or even Sam Altman?), civilization probably wouldn’t permit people to try building AGI. The issue is that the general public, as well as powerful decision makers, don’t even have this degree of legible understanding, so the bottleneck is convincing them.
  - Wei Dai 8 Nov 2025 6:32 UTC
    8 points
    3
    Parent
    Yes, some people are already implicitly doing this, but if we don’t make it explicit:
    
    We can’t explain to the people not doing it (i.e., those working on already legible problems) why they should switch directions.
    Even MIRI is doing it suboptimally because they’re not reasoning about it explicitly. I think they’re focusing too much on one particular x-safety problem (AI takeover caused by misalignment) that’s highly legible to themselves and not to the public/policymakers, and that’s problematic because what happens if someone comes up with an alignment breakthrough? Their arguments become invalidated and there’s no reason to stop holding back AGI/ASI anymore (in the public/policymakers’ eyes), but still plenty of illegible x-safety problems left.