Zack_M_Davis comments on Eliezer and I wrote a book: If Anyone Builds It, Everyone Dies

Zack_M_Davis 21 Jan 2026 3:11 UTC
0 points
0

maybe you are only saying that the explanation as written is bad at taking readers from (1) to (4) because it does not explicitly mention (2), i.e. not technically wrong but still a bad explanation. [...] But I don’t see how it makes Duncan’s summary either untrue or misleading, because eliding it doesn’t change (1) or (4).

But bad explanations are wrong, untrue, and misleading.

Suppose the one comes to you and says, “All squares are quadrilaterals; all rectangles are quadrilaterals; therefore, all squares are rectangles.” That argument is wrong—”technically” wrong, if you prefer. It doesn’t matter that the conclusion is true. It doesn’t even matter that the premises are also true. It’s just wrong.
- Joe Rogero 21 Jan 2026 22:04 UTC
  1 point
  0
  Parent
  Okay, but why is it wrong though? I still haven’t seen a convincing case for that! It sure looks to me like, given an assumption which I still feel confused about whether you share, the conclusion does in fact follow from the premises, even in metaphor form.
  I am open to the case that it’s a bad argument. If it is in fact a bad argument then that’s a legitimate criticism. But from my perspective you have not adequately spelled out how “deep nets favor simple functions” implies it’s a bad argument.
  - Zack_M_Davis 24 Jan 2026 2:32 UTC
    2 points
    0
    Parent
    You said, “I don’t see how [not mentioning inductive biases] makes Duncan’s summary either untrue or misleading, because eliding it doesn’t change (1) [we choose “teal shape” data to grow the “black shape” AI] or (4) [we don’t get the AI we want].” But the point of the broken syllogism in the grandparent is that it’s not enough for the premise to be true and the conclusion to be true; the conclusion has to follow from the premise.
    
    The context of the teal/black shape analogy in the article is an explanation of how “modern AIs aren’t really designed so much as grown or evolved” with the putative consequence that “there are many, many, many different complex architectures that are consistent with behaving ‘properly’ in the training environment, and most of them don’t resemble the thing the programmers had in mind”.
    
    Set aside the question of superintelligence for the moment. Is this true as a description of “modern AIs”, e.g., image classifiers? That’s not actually clear to me.
    
    It is true that adversarially robust image classification isn’t a solved problem, despite efforts: it’s usually possible (using the same kind of gradient-based optimization used to train the classifiers themselves) to successfully search for “adversarial examples” that machines classify differently than humans, which isn’t what the programmers had in mind.
    
    But Ilyas et al. 2019 famously showed that adversarial examples are often due to “non-robust” features that are doing predictive work, but which are counterintuitive to humans. That would be an example of our data pointing at, as you say, an “underlying simplicity of which we are unaware”.
    
    I’m saying that’s a different problem than a counting argument over putative “many, many, many different complex architectures that are consistent with behaving ‘properly’ in the training environment”, which is what the black/teal shape analogy seems to be getting at. (There are many, many, many different parametrizations that are consistent with behaving properly in training, but I’m claiming that the singular learning theory story explains why that might not be a problem, if they all compute similar functions.)
    - Joe Rogero 27 Jan 2026 22:23 UTC
      1 point
      0
      Parent
      Thank you for attempting to spell this out more explicitly. If I understand correctly, you are saying singular learning theory suggests that AIs with different architectures will converge on a narrow range of similar functions that best approximate the training data.
      With less confidence, I understand you to be claiming that this convergence implies that (in the context of the metaphor) a given [teal thing / dataset] may reliably produce a particular shape of [black thing / AI].
      So (my nascent Zack model says) the summary is incorrect to analogize the black thing to “architectures” instead of “parametrizations” or “functions”, and more importantly incorrect to claim that the black shape’s many degrees of freedom imply it will take a form its developers did not intend. (Because, by SLT, most shapes converge to some relatively simple function approximator.)
      But...it does seem to me like an AI trained using modern methods, e.g. constitutional AI, is insufficiently constrained to embody human-compatible values even given the stated interpretation of SLT. Or in other words, the black shape is still basically unpredictable from the perspective of the teal-shape drawer. I’m not sure you disagree with that?
      As an exercise in inferential gap-crossing, I want to try to figure out what minimum change to the summary / metaphor would make it relatively unobjectionable to you.
      Attempting to update the analogy in my own model, it would go something like: You draw a [teal thing / dataset]. You use it to train the [black thing / AI]. There are underlying regularities in your dataset, some of which are legible to you as a human and some of which are not. The black thing conforms to all the regularities. This does not by coincidence happen to cause it to occupy the shape you hoped for; you do not see all the non-robust / illegible features constraining that shape. You end up with [weird shape] instead of [simple shape you were aiming for.]
      A more skeptical Zack-model in my head says “No, actually, you don’t end up with [weird shape] at all. SLT says you can get [shape which robustly includes the entire spectrum of reflectively consistent human values] because that’s the function being approximated, the underlying structure of the data.” I dunno if this is an accurate Zack-model.
      (I am running into the limited bandwith of text here, and will also DM a link to schedule a conversation if you’re so inclined).
      - Zack_M_Davis 30 Jan 2026 1:55 UTC
        2 points
        0
        Parent
        Sorry, I don’t want to accidentally overemphasize SLT in particular, which I am not an expert in. I think what’s at issue is how predictable deep learning generalization is: what kind of knowledge would be necessary in order to “get what you train for”?
        
        This isn’t obvious from first principles. Given a description of SGD and the empirical knowledge of 2006, you could imagine it going either way. Maybe we live in a “regular” computational universe, where the AI you get depends on your architecture and training data according to learnable principles that can be studied by the usual methods of science in advance of the first critical try, but maybe it’s a “chaotic” universe where you can get wildly different outcomes depending on the exact path taken by SGD.
        
        A lot of MIRI’s messaging, such as the black shape metaphor, seems to assume that we live in a chaotic universe, as when Chapter 4 of If Anyone Builds It claims that the preferences of powerful AI “might be chaotic enough that if you tried it twice, you’d get different results each time.” But I think that if you’ve been paying attention to the literature about the technology we’re discussing, there’s actually a lot of striking empirical evidence that deep learning is much more “regular” than someone might have guessed in 2006: things like how Draxler et al. 2018 showed that you can find continuous low-loss paths between the results of different training runs (rather than being in different basins which might have wildly different generalization properties), or how Moschella et al. 2022 found that different models trained on different data end up learning the same latent space (such that representations by one can be reused by another without extra training). Those are empirical results; the relevance of SLT is as a theoretical insight as to how these results are even possible, in contrast to how people in 2006 might have had the intuition, “Well, ‘stochastic’ is right there as the ‘S’ in SGD, of course the outcome is going to be unpredictable.”
        
        it does seem to me like an AI trained using modern methods, e.g. constitutional AI, is insufficiently constrained to embody human-compatible values [...] the black shape is still basically unpredictable from the perspective of the teal-shape drawer
        
        I think it’s worth being really specific about what kind of “AI” you have in mind when you make this kind of claim. You might think, “Well, obviously I’m talking about superintelligence; this is a comment thread about a book about why people shouldn’t build superintelligence.”
        
        The problem is that is that if you try to persuade people to not build superintelligence using arguments that seem to apply just as well to the kind of AI we have today, you’re not going to be very convincing when people talk to human-compatible AIs behaving pretty much the way their creators intended all the time every day.
        
        That’s what I’m focused on in this thread: the arguments, not the conclusion. (This methodology is probably super counterintuitive to a lot of people, but it’s part of this website’s core canon.) I’m definitely not saying anyone knows how to train the “entire spectrum of reflectively consistent human values”. That’s philosophy, which is hard. I’m thinking about a much narrower question of computer science.
        
        Namely: if I take the black shape metaphor or Chapter 4 of If Anyone Builds It at face value, it’s pretty confusing how RLAIF approaches like constitutional AI can work at all. Not just when hypothetically scaled to superintelligence. I mean, at all. Upthread, I wrote about how people customize base language models by upweighting trajectories chosen by a model trained to predict human approval and disapproval ratings.
        
        In RLAIF, they use an LLM itself to provide the ratings instead of any actual humans. If you only read MIRI’s propaganda (in its literal meaning, “public communication aimed at influencing an audience and furthering an agenda”) and don’t read ArXiV, that just sounds suicidal.
        
        But it’s working! (For now.) It’s working better than the version with actual human preference rankings! Why? How? Prosaic alignment optimists would say: it learned the intended Platonic representation from pretraining. Are they wrong? Maybe! (I’m still worried about what happens if you optimize too hard against the learned representation.)
        
        But in order to convince policymakers that the prosaic alignment optimists are wrong (while the prosaic alignment optimists are passing them bars of AI-printed gold under the table), you’re going to need a stronger argument than “the black shape is still basically unpredictable from the perspective of the teal-shape drawer”. If it were actually unpredictable, where is all this gold coming from?
        
        The black thing conforms to all the regularities. This does not by coincidence happen to cause it to occupy the shape you hoped for; you do not see all the non-robust / illegible features constraining that shape
        
        While we’re still in the regime of pretraining on largely human-generated data, this is arguably great for alignment. You don’t have to understand the complex structure of human value; you can just point SGD at valuable data and get a network that spits out “more from that distribution”, without any risk of accidentally leaving out boredom and destroying all value that way.
        
        Obviously, that doesn’t mean the humans are out of the woods. As the story of Earth-originating intelligent life goes on, and the capabilities of Society’s cutting-edge AIs start coming more and more from reinforcement learning and less and less from pretraining, you start run a higher risk of misspecifying your rewards, eventually fatally. But that world looks a lot more like Christiano’s “you get what you measure” scenario, rather than Part II of If Anyone Builds It, even if the humans are dead at the end of both stories. But the details matter for deciding which interventions are most dignified—possibly even if you think governance is more promising than alignment research. (Which specific regulations you want in your Pause treaty depends on which AI techniques are feasible and which ones are dangerous.)
        
        the summary is incorrect to analogize the black thing to “architectures” instead of “parametrizations” or “functions”
        
        Yes, the word choice of “architectures” in the phrase “many, many, many different complex architectures” in the article is puzzling. I don’t know what the author meant by that word, but to modern AI practitioners, “architecture” is the part of the system that is designed rather than “grown”: these-and-such many layers with such-and-these activation functions—the matrices, not the numbers inside them.